Thread: ERROR: subtransaction logged without previous top-level txn record

ERROR: subtransaction logged without previous top-level txn record

From
"Hsu, John"
Date:
Hello,

I saw the previous thread but it wasn't in my inbox at the time, so I’m creating a new one sorry about that.
https://www.postgresql.org/message-id/20190516170434.masck6ehwg2kvbi2@alap3.anarazel.de

I’ve managed to reproduce the issue pretty consistently on REL9_6_STABLE on commit
959792087a10baf7f1b58408d28411109bcedb7a
 

OS version:
[ec2-user@ ... ~]$ uname -a
... 4.14.77-80.57.amzn2.x86_64 #1 SMP Tue Nov 6 21:18:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Postgres version:
postgres=# SELECT version();
                                                 version                                                  
----------------------------------------------------------------------------------------------------------
 PostgreSQL 9.6.13 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5), 64-bit
(1 row)

I’m on an EC2 m5.4xlarge if that matters.

Repro steps:

1. Create the following script
[ec2-user@ip-172-31-18-48 ~]$ cat ~/subbench.txt 
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
\set subcnt random(1, 800)
select * from pgbench(:aid, :bid, :tid, :delta, :subcnt);

2. Create the following function:

CREATE OR REPLACE FUNCTION pgbench(pAid int, pBid int, pTid int, delta int, subcnt int) returns int as $$
DECLARE
    abal int;
BEGIN
    FOR i in 1 .. subcnt LOOP
        BEGIN
            UPDATE pgbench_accounts SET abalance = abalance + delta WHERE aid = pAid;
            --subcnt := subcnt;
        EXCEPTION
        WHEN division_by_zero THEN
            subcnt := subcnt;
        END;
    END LOOP;
    abal := abalance FROM pgbench_accounts WHERE aid = pAid;
    return abal;
END; $$LANGUAGE 'plpgsql';

3. Create a few logical slots in the database

select pg_create_logical_replication_slot('test_slot_1', 'test_decoding');
select pg_create_logical_replication_slot('test_slot_2', 'test_decoding');
select pg_create_logical_replication_slot('test_slot_3', 'test_decoding');
...

4. Initialize pgbench
pgbench -i -d postgres

5. Load the data
pgbench -f subbench.txt -c 64 -j 64 -T 600 -P 1 -d postgres

6. Run pg_recvlogical with a timeout, it usually takes a few iterations (~7-9) before the error occurs

var=0
while true; do 
timeout 30 pg_recvlogical -d postgres --start --slot test_slot_1 -f /dev/null;
var=$((var+1))
echo "Sleeping 5s Time: $var";
sleep 5; 
done

pg_recvlogical -d postgres --start --slot test_slot_1 -f -
pg_recvlogical: unexpected termination of replication stream: ERROR:  subtransaction logged without previous top-level
txnrecord
 

pg_recvlogical -d postgres --start --slot test_slot_2 -f -
pg_recvlogical: unexpected termination of replication stream: ERROR:  subtransaction logged without previous top-level
txnrecord
 

pg_recvlogical -d postgres --start --slot test_slot_3 -f -
pg_recvlogical: unexpected termination of replication stream: ERROR:  subtransaction logged without previous top-level
txnrecord
 
pg_recvlogical: disconnected; waiting 5 seconds to try again

What's interesting is that the confirmed_flush_lsn are all different from test_slot_1 --> test_slot_3

postgres=# select * from pg_replication_slots;
  slot_name   |    plugin     | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn
|confirmed_flush_lsn 
 

--------------+---------------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
 test_slot_1  | test_decoding | logical   |  13382 | postgres | f      |            |      |         1848 | 0/1C5BC5A0
|0/5488E468
 
 test_slot_2  | test_decoding | logical   |  13382 | postgres | f      |            |      |         1848 | 0/1C5BC5A0
|0/40E45EA0
 
 test_slot_3  | test_decoding | logical   |  13382 | postgres | f      |            |      |         1848 | 0/3F4B6AF8
|0/6BB3A990
 


Let me know if you require more info to repro.

Thanks!

John H








Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Hi,

Our customer also encountered this issue and I've looked into it. The problem is
reproduced well enough using the instructions in the previous message.

The check leading to this ERROR is too strict, it forbids legit behaviours. Say
we have in WAL

[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]

- First xl_xact_assignment record is beyond reading, i.e. earlier
  restart_lsn, where ready snapshot will be taken from disk.
- After restart_lsn there is some change of a subxact.
- After that, there is second xl_xact_assignment (for another subxact)
  revealing relationship between top and first subxact, where this ERROR fires.

Such transaction won't be streamed because we hadn't seen it in full. It must be
finished before streaming will start, i.e. before confirmed_flush_lsn.

Of course, the easiest fix is to just throw the check out. However, supposing
that someone would probably want to relax it instead, I considered ways to
accomplish this. Something like 'if we are still in SNAPSHOT_FULL and xid is
before SnapBuildNextPhaseAt, just ignore xl_xact_assignment record, we haven't
seen such xact in full and definitely won't stream it.' That led to discovery of
another bug in the place which I had found suspicious long before.

Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
just complete snapshot (snapshot by itself in FULL state is just good as in
CONSISTENT), but also reorderbuffer filled with all currently running
xacts. This is painless for decoding sessions with existing slots because they
won't stream anything before confirmed_flush_lsn is reached anyway, at which
point all transactions which hadn't got into reorderbuffer would definitely
finish. However, new slots might be created too early, thus losing (not
decoding) parts of transactions committed after freshly created
confirmed_flush_lsn. This can happen under the following extremely unlucky
circumstances:
  - New slot creation reserves point in WAL since which it would read it
    (GetXLogInsertRecPtr);
  - It logs xl_running_xacts to start assembling a snapshot;
  - Running decoding session with another slot quickly reads this
    xl_running_xacts and serializes its snapshot;
  - New slot reads xl_running_xacts and picks this snapshot up, saying that it
    is ready to stream henceforth, though its reorderbuffer is empty.

Exact reproducing steps:

-- session 1
create table t (i int);
select pg_create_logical_replication_slot('slot_1', 'test_decoding');

-- session 2
begin;
insert into t values (1);

-- session 3, start slot creation
select pg_create_logical_replication_slot('slot_2', 'test_decoding');
-- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);

-- session 1
-- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
-- need to sleep; our snap will be immediately serialized there
SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');

-- continue slot_2 creation

-- session 2: insert some more and commit
insert into t values (1);
commit;

-- now this would find second insert, but not the first one
SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');


What we can do here? Initially I was like, ok, then let's get into FULL_SNAPSHOT
upon deserializing the snap and wait for all xacts finish as usual. However, to
my surprise I've found this impossible. That is, snapbuilder has no way to
enforce that we go into CONSISTENT only when we have seen all running xacts
completely without risk of skipping legit transactions. Specifically, after
deserializing FULL snapshot snapbuilder must iterate over WAL further until all
running xacts finish, as we must see with correct snapshots all changes of every
transaction we are going to stream. However, snapbuilder can't *immediately*
notice this point, because
 - Snapbuilder updates xmin (first running xact) by taking it from xl_running_xacts
   (c.f. SnapBuildProcessRunningXacts). Even if we guarantee that, for
   each possible WAL reading starting position, there is always an an
   xl_running_xacts records logged right before the earliest possible
   streaming point -- IOW, after all xacts which we can't stream had
   finished (which is currently true btw, as slot's advancement is
   considered only at xl_running_xacts) -- that would not be enough due
   to races around xl_running_xacts, i.e with WAL like
   [ <T1> <restart_lsn> <T1 commit> <confirmed_flush_lsn, xrx> <T2 commit> ]
   T2 might be skipped if T1 is shown as running in xl_running_xacts.
 - Tracking xmin manually by recoding commits is not only inefficient,
   it just not feasible because serialized snapshot is not full: it
   contains only committed catalog-modifying xacts. Thus, we can't
   distinguish non-catalog-modifying xact committed before serialized
   snapshot from not yet committed one.

Which means only code external to snapbuilder knows the earliest point suitable
for streaming; slot advancement machinery ensures that <restart_lsn,
confirmed_flush_lsn> pair is always good. So possible fix is the following: if
snapbuilder's user knows exact LSN since which streaming is safe (existing slot,
essentially), trust him and and switch into CONSISTENT state after deserializing
snapshot as before. OTOH, if he doesn't know it (new slot creation), go via
usual FULL -> CONSISTENT procedure; we might transition into CONSISTENT a bit
later than it became possible, but there is nothing bad about that.

First attached patch implements this. I don't particularly like it, but the only
alternative which I see is to rework slots advancement logic to make
<restart_lsn, confirmed_flush_lsn> pair such that there is always
xl_running_xacts before confirmed_flush_lsn which confirms all xacts running as
of restart_lsn have finished. This unnecessary complexity looks much worse.


As for the check in the topic, I nonetheless propose to remove it completely, as
in second attached patch. Saying for sure whether xact of some record
encountered after snapshot was deserialized can be streamed or not requires to
know nextXid (first not yet running xid) as of this snapshot's lsn -- all xids <
nextXid possibly hadn't been seen in full and are not subject to
decoding. However, generally we don't know nextXid which is taken from
xl_running_xacts; in particular snapshot can be serizalized/deserialized at
XLOG_END_OF_RECOVERY. Changing that for the sake of the check in question is not
worthwhile, so just throw it out instead of trying to relax.

In fact, I don't see what is so important about seeing the top xact first at
all. During COMMIT decoding we'll know all subxids anyway, so why care?


P.S. While digging this, I have noted that return values of
SnapBuildFindSnapshot seem pretty random to me. Basically returning 'true'
performs immediately 4 things:
 - update xmin
 - purge old xip entries
 - advance xmin of the slot
 - if CONSISTENT, advance lsn (earliest serialized snap)

The latter two make sense only after slot created or confirmed_flush_lsn
reached. The first two make sense even immediately after deserializing the
snapshot (because it is serialized *before* updating xmin and xip); generally,
always when committed xids are tracked. Then why cleanup is done when xmin
horizon is too low? Why it is not performed after restoring serialized snapshot?
On the whole, I find this not very important as all these operations are pretty
cheap and harmless if executed too early -- it would be simpler just do them
always.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


From 86a03671c3bdbaee3dabe02a9a1f8e2e23098652 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Tue, 22 Oct 2019 19:02:14 +0300
Subject: [PATCH 1/2] Fix serialized snapshot usage for new logical slots.

Previously, snapbuilder entered into SNAPBUILD_CONSISTENT immediately after
deserializing the snapshot. Generally this is incorrect because
SNAPBUILD_CONSISTENT means not just complete snapshot, but also reorderbuffer
filled with all currently running xacts. This is painless for decoding sessions
with existing slots because they won't stream anything before
confirmed_flush_lsn is reached anyway, at which point all transactions which
hadn't got into reorderbuffer would definitely finish. However, new slots might
be created too early, thus losing (not decoding) parts of transactions committed
after freshly created confirmed_flush_lsn. This can happen under the following
extremely unlucky circumstances:
 - New slot creation reserves point in WAL since which it would read it
   (GetXLogInsertRecPtr);
 - It logs xl_running_xacts to start assembling a snapshot;
 - Running decoding session with another slot quickly reads this
   xl_running_xacts and serializes its snapshot;
 - New slot reads xl_running_xacts and picks this snapshot up, saying that it
   is ready to stream henceforth, though its reorderbuffer is empty.

It turns out, as comment to AllocateSnapshotBuilder explains, that snapbuilder
can't locate *earliest* possible point for streaming on its own. Thus, if
snapbuild users know it (decoding session for existing slot), we trust them and
switch after snapshot deserialization into SNAPBUILD_CONSISTENT as
previously. However, if it is not known (new slot creation), switch into
SNAPBUILD_FULL_SNAPSHOT and wait for all running xacts to finish as usual.
---
 src/backend/replication/logical/logical.c   |   2 +-
 src/backend/replication/logical/snapbuild.c | 141 ++++++++++++++++++++++------
 2 files changed, 115 insertions(+), 28 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index da265f5294..67ed52e56f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin,
     ReplicationSlotMarkDirty();
     ReplicationSlotSave();
 
-    ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
+    ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon,
                                  need_full_snapshot, false,
                                  read_page, prepare_write, do_write,
                                  update_progress);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0bd1d0f954..3ed178a4f8 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,7 +165,7 @@ struct SnapBuild
 
     /*
      * Don't replay commits from an LSN < this LSN. This can be set externally
-     * but it will also be advanced (never retreat) from within snapbuild.c.
+     * or established by snapbuild.c once consistent snapshot is assembled.
      */
     XLogRecPtr    start_decoding_at;
 
@@ -275,7 +275,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
-static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn, TransactionId nextXid);
 
 /*
  * Return TransactionId after which the next phase of initial snapshot
@@ -309,7 +309,42 @@ SnapBuildStartNextPhaseAt(SnapBuild *builder, TransactionId at)
  * Allocate a new snapshot builder.
  *
  * xmin_horizon is the xid >= which we can be sure no catalog rows have been
- * removed, start_lsn is the LSN >= we want to replay commits.
+ * removed.
+ * start_lsn is InvalidXLogRecPtr or LSN >= we want to replay commits.
+ *  InvalidXLogRecPtr is used during slot creation; snapbuild will assemble
+ *  consistent snapshot and (if continue decoding -- no core code does that
+ *  currently) stream all further commits.
+ *  If normal lsn is passed, caller *must* be sure that WAL reading starts
+ *  early enough to build the snapshot and pick up the first change of
+ *  earliest xact to stream. Slot creation and advancement machinery
+ *  guarantees that slot's <restart_lsn, confirmed_flush_lsn> pair always
+ *  satisfies this.
+ *
+ *    The condition is the caller's responsibility because in some cases
+ *    snapbuilder has no way to enforce this rule without risk of skipping legit
+ *    transactions. Specifically, after constructing (usually deserializing) FULL
+ *    snapshot snapbuilder must iterate over WAL further until all running xacts
+ *    from the snap finish, as we must see with correct snapshots all changes of
+ *    every transaction we are going to stream. However, snapbuild can't
+ *  *immediately* notice this point, because
+ *    - Snapbuilder updates xmin by taking it from xl_running_xacts
+ *      (c.f. SnapBuildProcessRunningXacts). Even if we guarantee that, for
+ *      each possible WAL reading starting position, there is always an an
+ *      xl_running_xacts records logged right before the earliest possible
+ *      streaming point -- IOW, after all xacts which we can't stream had
+ *      finished (which is currently true btw, as slot's advancement is
+ *      considered only at xl_running_xacts) -- that would not be enough due
+ *      to races around xl_running_xacts, i.e with WAL like
+ *      [ <T1> <restart_lsn> <T1 commit> <confirmed_flush_lsn, xrx> <T2 commit> ]
+ *      T2 might be skipped if T1 is shown as running in xl_running_xacts.
+ *      - Tracking xmin manually by recoding commits is not only inefficient,
+ *      it just not feasible because serialized snapshot is not full: it
+ *      contains only committed catalog-modifying xacts. Thus, we can't
+ *      distinguish non-catalog-modifying xact committed before serialized
+ *      snapshot from not yet committed one.
+ *  So, trust the caller: once given start_lsn is reached, it means we must
+ *  have reached the point where all further xacts can be streamed, FULL ->
+ *  CONSISTENT transition.
  */
 SnapBuild *
 AllocateSnapshotBuilder(ReorderBuffer *reorder,
@@ -408,7 +443,8 @@ SnapBuildCurrentState(SnapBuild *builder)
 bool
 SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
 {
-    return ptr < builder->start_decoding_at;
+    return XLogRecPtrIsInvalid(builder->start_decoding_at) ||
+        ptr < builder->start_decoding_at;
 }
 
 /*
@@ -945,16 +981,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
          TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder))))
     {
         /* ensure that only commits after this are getting replayed */
-        if (builder->start_decoding_at <= lsn)
-            builder->start_decoding_at = lsn + 1;
+        Assert(XLogRecPtrIsInvalid(builder->start_decoding_at)    ||
+               builder->start_decoding_at > lsn);
         return;
     }
 
     if (builder->state < SNAPBUILD_CONSISTENT)
     {
         /* ensure that only commits after this are getting replayed */
-        if (builder->start_decoding_at <= lsn)
-            builder->start_decoding_at = lsn + 1;
+        Assert(XLogRecPtrIsInvalid(builder->start_decoding_at)    ||
+               builder->start_decoding_at > lsn);
 
         /*
          * If building an exportable snapshot, force xid to be tracked, even
@@ -966,6 +1002,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
         }
     }
 
+    if (!XLogRecPtrIsInvalid(builder->start_decoding_at) &&
+        builder->start_decoding_at <= lsn)
+    {
+        /*
+         * We are going to stream this xact, so must already have fine
+         * snapshot.
+         */
+        Assert(builder->state == SNAPBUILD_CONSISTENT);
+    }
+
     for (nxact = 0; nxact < nsubxacts; nxact++)
     {
         TransactionId subxid = subxacts[nxact];
@@ -1250,10 +1296,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
      */
     if (running->oldestRunningXid == running->nextXid)
     {
-        if (builder->start_decoding_at == InvalidXLogRecPtr ||
-            builder->start_decoding_at <= lsn)
+        if (XLogRecPtrIsInvalid(builder->start_decoding_at))
             /* can decode everything after this */
             builder->start_decoding_at = lsn + 1;
+        Assert(builder->start_decoding_at >= lsn);
 
         /* As no transactions were running xmin/xmax can be trivially set. */
         builder->xmin = running->nextXid;    /* < are finished */
@@ -1275,9 +1321,9 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
     }
     /* b) valid on disk state and not building full snapshot */
     else if (!builder->building_full_snapshot &&
-             SnapBuildRestore(builder, lsn))
+             SnapBuildRestore(builder, lsn, running->nextXid))
     {
-        /* there won't be any state to cleanup */
+        /* there won't be much state to cleanup */
         return false;
     }
 
@@ -1358,6 +1404,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
     {
         builder->state = SNAPBUILD_CONSISTENT;
         SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
+        if (XLogRecPtrIsInvalid(builder->start_decoding_at))
+            /* can decode everything after this */
+            builder->start_decoding_at = lsn + 1;
+        Assert(builder->start_decoding_at >= lsn);
 
         ereport(LOG,
                 (errmsg("logical decoding found consistent point at %X/%X",
@@ -1471,8 +1521,8 @@ typedef struct SnapBuildOnDisk
 void
 SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn)
 {
-    if (builder->state < SNAPBUILD_CONSISTENT)
-        SnapBuildRestore(builder, lsn);
+    if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
+        SnapBuildRestore(builder, lsn, InvalidTransactionId);
     else
         SnapBuildSerialize(builder, lsn);
 }
@@ -1499,10 +1549,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
            builder->last_serialized_snapshot <= lsn);
 
     /*
-     * no point in serializing if we cannot continue to work immediately after
-     * restoring the snapshot
+     * No point in serializing if the snapshot is not complete.
+     * However, FULL snapshot is just as good as CONSISTENT; difference
+     * between these states is not snapshot property, but whether we have
+     * filled reorderbuffer with all currently running xacts.
      */
-    if (builder->state < SNAPBUILD_CONSISTENT)
+    if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
         return;
 
     /*
@@ -1688,10 +1740,15 @@ out:
 
 /*
  * Restore a snapshot into 'builder' if previously one has been stored at the
- * location indicated by 'lsn'. Returns true if successful, false otherwise.
+ * location indicated by 'lsn'.
+ * nextXid is first not yet running xid as of this lsn or InvalidTransactionId;
+ *   unless external code knows where it is safe to start streaming, we can
+ *   use serialized snapshot only if we are aware which xids finish we must
+ *   wait to be able to stream all further commits.
+ *  Returns true if successful, false otherwise.
  */
 static bool
-SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn, TransactionId nextXid)
 {
     SnapBuildOnDisk ondisk;
     int            fd;
@@ -1701,7 +1758,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
     pg_crc32c    checksum;
 
     /* no point in loading a snapshot if we're already there */
-    if (builder->state == SNAPBUILD_CONSISTENT)
+    if (builder->state >= SNAPBUILD_FULL_SNAPSHOT)
         return false;
 
     sprintf(path, "pg_logical/snapshots/%X-%X.snap",
@@ -1884,11 +1941,47 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
     if (TransactionIdPrecedes(ondisk.builder.xmin, builder->initial_xmin_horizon))
         goto snapshot_not_interesting;
 
+    /*
+     * Don't use snapshot if external code doesn't know where it is safe to
+     * start streaming and we have no idea which xids to wait for.
+     */
+    if (XLogRecPtrIsInvalid(builder->start_decoding_at) &&
+        !TransactionIdIsValid(nextXid))
+        goto snapshot_not_interesting;
 
     /* ok, we think the snapshot is sensible, copy over everything important */
     builder->xmin = ondisk.builder.xmin;
     builder->xmax = ondisk.builder.xmax;
-    builder->state = ondisk.builder.state;
+    Assert(ondisk.builder.state >= SNAPBUILD_FULL_SNAPSHOT);
+
+    if (XLogRecPtrIsInvalid(builder->start_decoding_at))
+    {
+        /*
+         * Snapshot is fine, now we need to wait till we see all further
+         * commits since the xact's first record.
+         */
+        builder->state = SNAPBUILD_FULL_SNAPSHOT;
+        SnapBuildStartNextPhaseAt(builder, nextXid);
+    }
+    else
+    {
+        /*
+         * If external code (c.f. AllocateSnapshotBuilder) knows we would pick
+         * up all xacts in full before start_decoding_at, just go directly
+         * into CONSISTENT. Though we probably can't stream right now (as we
+         * haven't seen beginnings of some xacts), no xact will be streamed
+         * before start_decoding_at, and we can't be sure to switch into
+         * CONSISTENT later in time anyway.
+         */
+        builder->state = SNAPBUILD_CONSISTENT;
+        SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
+
+        ereport(LOG,
+                (errmsg("logical decoding found consistent point at %X/%X",
+                        (uint32) (lsn >> 32), (uint32) lsn),
+                 errdetail("Logical decoding will begin using saved snapshot.")));
+    }
+
 
     builder->committed.xcnt = ondisk.builder.committed.xcnt;
     /* We only allocated/stored xcnt, not xcnt_space xids ! */
@@ -1911,12 +2004,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
     ReorderBufferSetRestartPoint(builder->reorder, lsn);
 
-    Assert(builder->state == SNAPBUILD_CONSISTENT);
-
-    ereport(LOG,
-            (errmsg("logical decoding found consistent point at %X/%X",
-                    (uint32) (lsn >> 32), (uint32) lsn),
-             errdetail("Logical decoding will begin using saved snapshot.")));
     return true;
 
 snapshot_not_interesting:
-- 
2.11.0

From ab798a087d99ea8d7d0df4c296fee5787cf5394e Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Wed, 23 Oct 2019 15:56:46 +0300
Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact
 in decoding.

Manifested as
ERROR:  subtransaction logged without previous top-level txn record
, this check forbids legit behaviours like
 - First xl_xact_assignment record is beyond reading, i.e. earlier
   restart_lsn.
 - After restart_lsn there is some change of a subxact.
 - After that, there is second xl_xact_assignment (for another subxact)
   revealing relationship between top and first subxact.

Such transaction won't be streamed anyway because we hadn't seen it in full;
confirmed_flush_lsn must be past all these records. Saying for sure whether xact
of some record encountered after snapshot was deserialized can be streamed or
not requires to know nextXid (first not yet running xid) as of this snapshot's
lsn -- all xids < nextXid possibly hadn't been seen in full and are not subject
to decoding. However, generally we don't know nextXid which is taken from
xl_running_xacts; in particular snapshot can be serizalized/deserialized at
XLOG_END_OF_RECOVERY. Changing that for the sake of the check in question is not
worthwhile, so just throw it out instead of trying to relax.
---
 src/backend/replication/logical/reorderbuffer.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8ce28ad629..6faba6077e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
     txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
     subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
 
-    if (new_top && !new_sub)
-        elog(ERROR, "subtransaction logged without previous top-level txn record");
-
     if (!new_sub)
     {
         if (subtxn->is_known_as_subxact)
-- 
2.11.0


Re: ERROR: subtransaction logged without previous top-level txnrecord

From
Andres Freund
Date:
Hi,

On 2019-10-24 12:59:30 +0300, Arseny Sher wrote:
> Our customer also encountered this issue and I've looked into it. The problem is
> reproduced well enough using the instructions in the previous message.

Is this with

commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   2019-09-13 16:36:28 -0300

    logical decoding: process ASSIGNMENT during snapshot build
    
    Most WAL records are ignored in early SnapBuild snapshot build phases.
    But it's critical to process some of them, so that later messages have
    the correct transaction state after the snapshot is completely built; in
    particular, XLOG_XACT_ASSIGNMENT messages are critical in order for
    sub-transactions to be correctly assigned to their parent transactions,
    or at least one assert misbehaves, as reported by Ildar Musin.
    
    Diagnosed-by: Masahiko Sawada
    Author: Masahiko Sawada
    Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com

applied?

> The check leading to this ERROR is too strict, it forbids legit behaviours. Say
> we have in WAL
> 
> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
> 
> - First xl_xact_assignment record is beyond reading, i.e. earlier
>   restart_lsn, where ready snapshot will be taken from disk.
> - After restart_lsn there is some change of a subxact.
> - After that, there is second xl_xact_assignment (for another subxact)
>   revealing relationship between top and first subxact, where this ERROR fires.
> 
> Such transaction won't be streamed because we hadn't seen it in full. It must be
> finished before streaming will start, i.e. before confirmed_flush_lsn.
> 
> Of course, the easiest fix is to just throw the check out.

I don't think that'd actually be a fix, and just hiding a bug.


> Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
> snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
> just complete snapshot (snapshot by itself in FULL state is just good as in
> CONSISTENT), but also reorderbuffer filled with all currently running
> xacts. This is painless for decoding sessions with existing slots because they
> won't stream anything before confirmed_flush_lsn is reached anyway, at which
> point all transactions which hadn't got into reorderbuffer would definitely
> finish. However, new slots might be created too early, thus losing (not
> decoding) parts of transactions committed after freshly created
> confirmed_flush_lsn. This can happen under the following extremely unlucky
> circumstances:
>   - New slot creation reserves point in WAL since which it would read it
>     (GetXLogInsertRecPtr);
>   - It logs xl_running_xacts to start assembling a snapshot;
>   - Running decoding session with another slot quickly reads this
>     xl_running_xacts and serializes its snapshot;
>   - New slot reads xl_running_xacts and picks this snapshot up, saying that it
>     is ready to stream henceforth, though its reorderbuffer is empty.

Yea, that's a problem :(


> Exact reproducing steps:
> 
> -- session 1
> create table t (i int);
> select pg_create_logical_replication_slot('slot_1', 'test_decoding');
> 
> -- session 2
> begin;
> insert into t values (1);
> 
> -- session 3, start slot creation
> select pg_create_logical_replication_slot('slot_2', 'test_decoding');
> -- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);
> 
> -- session 1
> -- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
> -- need to sleep; our snap will be immediately serialized there
> SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
> 
> -- continue slot_2 creation
> 
> -- session 2: insert some more and commit
> insert into t values (1);
> commit;
> 
> -- now this would find second insert, but not the first one
> SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');

It'd be good to make this into an isolation test.


> What we can do here?

I think the easiest fix might actually be to just ignore serialized
snapshots when building the initial snapshot.


Greetings,

Andres Freund



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Andres Freund <andres@anarazel.de> writes:

> Hi,
>
> On 2019-10-24 12:59:30 +0300, Arseny Sher wrote:
>> Our customer also encountered this issue and I've looked into it. The problem is
>> reproduced well enough using the instructions in the previous message.
>
> Is this with
>
> commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
> Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Date:   2019-09-13 16:36:28 -0300
>
>     logical decoding: process ASSIGNMENT during snapshot build
>
>     Most WAL records are ignored in early SnapBuild snapshot build phases.
>     But it's critical to process some of them, so that later messages have
>     the correct transaction state after the snapshot is completely built; in
>     particular, XLOG_XACT_ASSIGNMENT messages are critical in order for
>     sub-transactions to be correctly assigned to their parent transactions,
>     or at least one assert misbehaves, as reported by Ildar Musin.
>
>     Diagnosed-by: Masahiko Sawada
>     Author: Masahiko Sawada
>     Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com
>
> applied?

Yeah, I tried fresh master. See below: skipped xl_xact_assignment is
beyond restart_lsn at all (and thus not read), so this doesn't matter.


>> The check leading to this ERROR is too strict, it forbids legit behaviours. Say
>> we have in WAL
>>
>> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
>>
>> - First xl_xact_assignment record is beyond reading, i.e. earlier
>>   restart_lsn, where ready snapshot will be taken from disk.
>> - After restart_lsn there is some change of a subxact.
>> - After that, there is second xl_xact_assignment (for another subxact)
>>   revealing relationship between top and first subxact, where this ERROR fires.
>>
>> Such transaction won't be streamed because we hadn't seen it in full. It must be
>> finished before streaming will start, i.e. before confirmed_flush_lsn.
>>
>> Of course, the easiest fix is to just throw the check out.
>
> I don't think that'd actually be a fix, and just hiding a bug.

I don't see a bug here. At least in reproduced scenario I see false
alert, as explained above: transaction with skipped xl_xact_assignment
won't be streamed as it finishes before confirmed_flush_lsn. And I am
pretty sure people encountered in the field the same issue.

In the end of my mail I have offered a way to relax this check instead
of removing it to avoid false triggers: serialize/deserialize a snapshot
only at xl_running_xacts to know nextXid, add function to snapbuilder to
check whether xact can be streamed or not by looking at its xid, etc,
somehow deal with existing serialized snaps which may be logged at
END_OF_RECOVERY without nextXid. I don't believe this check is worth
these complexities. If you think it does, I can do that though.

>
>> Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
>> snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
>> just complete snapshot (snapshot by itself in FULL state is just good as in
>> CONSISTENT), but also reorderbuffer filled with all currently running
>> xacts. This is painless for decoding sessions with existing slots because they
>> won't stream anything before confirmed_flush_lsn is reached anyway, at which
>> point all transactions which hadn't got into reorderbuffer would definitely
>> finish. However, new slots might be created too early, thus losing (not
>> decoding) parts of transactions committed after freshly created
>> confirmed_flush_lsn. This can happen under the following extremely unlucky
>> circumstances:
>>   - New slot creation reserves point in WAL since which it would read it
>>     (GetXLogInsertRecPtr);
>>   - It logs xl_running_xacts to start assembling a snapshot;
>>   - Running decoding session with another slot quickly reads this
>>     xl_running_xacts and serializes its snapshot;
>>   - New slot reads xl_running_xacts and picks this snapshot up, saying that it
>>     is ready to stream henceforth, though its reorderbuffer is empty.
>
> Yea, that's a problem :(
>
>
>> Exact reproducing steps:
>>
>> -- session 1
>> create table t (i int);
>> select pg_create_logical_replication_slot('slot_1', 'test_decoding');
>>
>> -- session 2
>> begin;
>> insert into t values (1);
>>
>> -- session 3, start slot creation
>> select pg_create_logical_replication_slot('slot_2', 'test_decoding');
>> -- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);
>>
>> -- session 1
>> -- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
>> -- need to sleep; our snap will be immediately serialized there
>> SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
>>
>> -- continue slot_2 creation
>>
>> -- session 2: insert some more and commit
>> insert into t values (1);
>> commit;
>>
>> -- now this would find second insert, but not the first one
>> SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
>
> It'd be good to make this into an isolation test.

Yeah, but to get real chance of firing this requires kinda sleep/break
in the middle of pg_create_logical_replication_slot execution, so I have
no idea how to do that =(

>
>> What we can do here?
>
> I think the easiest fix might actually be to just ignore serialized
> snapshots when building the initial snapshot.

That's an option. However, that anyway requires the distinction between
new slot creation and streaming from existing slot at snapbuilder level,
which currently doesn't exist and which constitutes most part of my
first patch. If we add that, changing between using and not using
serialized snapshots in new slot creation is easy (my patch uses it), I
think this is not principal.


--
Arseny Sher



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Andres, Álvaro, could you please have a look at this?

--
Arseny Sher



Re: ERROR: subtransaction logged without previous top-level txn record

From
Dan Katz
Date:
Is the resolution of the issue in this thread being tracked elsewhere, either in a commit fest or other stream of work?

On Tue, Dec 17, 2019 at 9:47 AM Arseny Sher <a.sher@postgrespro.ru> wrote:
Andres, Álvaro, could you please have a look at this?

--
Arseny Sher




Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Dan Katz <dkatz@joor.com> writes:

> Is the resolution of the issue in this thread being tracked elsewhere,
> either in a commit fest or other stream of work?

Ok, I've created a cf entry:
https://commitfest.postgresql.org/26/2383/

I believe it is the most important to commit at least

0002-Stop-demanding-that-top-xact-must-be-seen-before-sub.patch

from my mail above -- as we see, this issue creates problems in the
field. Moreover, the patch is trivial and hopefully I've shown that
ERROR triggers were spurious and there is no easy way to relax the
check.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: ERROR: subtransaction logged without previous top-level txn record

From
Dan Katz
Date:
Arseny,

I was hoping you could give me some insights about how this bug might appear with multiple replications slots. For example if I have two replication slots would you expect both slots to see the same error, even if they were started, consumed or the LSN was confirmed-flushed at different times?

Dan

On Tue, Dec 17, 2019 at 10:15 AM Arseny Sher <a.sher@postgrespro.ru> wrote:

Dan Katz <dkatz@joor.com> writes:

> Is the resolution of the issue in this thread being tracked elsewhere,
> either in a commit fest or other stream of work?

Ok, I've created a cf entry:
https://commitfest.postgresql.org/26/2383/

I believe it is the most important to commit at least

0002-Stop-demanding-that-top-xact-must-be-seen-before-sub.patch

from my mail above -- as we see, this issue creates problems in the
field. Moreover, the patch is trivial and hopefully I've shown that
ERROR triggers were spurious and there is no easy way to relax the
check.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Hi,

Dan Katz <dkatz@joor.com> writes:

> Arseny,
>
> I was hoping you could give me some insights about how this bug might
> appear with multiple replications slots. For example if I have two
> replication slots would you expect both slots to see the same error, even
> if they were started, consumed or the LSN was confirmed-flushed at
> different times?

Well, to encounter this you must happen to interrupt decoding session
(e.g. shutdown server) when restart_lsn (LSN since WAL will be read next
time) is at unfortunate position, as described in
https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad

Generally each slot has its own restart_lsn, so if one decoding session
stucked on this issue, another one won't necessarily fail at the same
time. However, restart_lsn can be advanced only to certain points,
mainly xl_running_xacts records, which is logged every 15 seconds. So if
all consumers acknowledge changes fast enough, it is quite likely that
during shutdown restart_lsn will be the same for all slots -- which
means either all of them will stuck on further decoding or all of them
won't. If not, different slots might have different restart_lsn and
probably won't fail at the same time; but encountering this issue even
once suggests that your workload makes possibility of such problematic
restart_lsn perceptible (i.e. many subtransactions). And each
restart_lsn probably has approximately the same chance to be 'bad'
(provided the workload is even).


We need a committer familiar with this code to look here...


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Fri, Oct 25, 2019 at 12:26 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
>
> Andres Freund <andres@anarazel.de> writes:
>
> > Hi,
> >
> > On 2019-10-24 12:59:30 +0300, Arseny Sher wrote:
> >> Our customer also encountered this issue and I've looked into it. The problem is
> >> reproduced well enough using the instructions in the previous message.
> >
> > Is this with
> >
> > commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
> > Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> > Date:   2019-09-13 16:36:28 -0300
> >
> >     logical decoding: process ASSIGNMENT during snapshot build
> >
> >     Most WAL records are ignored in early SnapBuild snapshot build phases.
> >     But it's critical to process some of them, so that later messages have
> >     the correct transaction state after the snapshot is completely built; in
> >     particular, XLOG_XACT_ASSIGNMENT messages are critical in order for
> >     sub-transactions to be correctly assigned to their parent transactions,
> >     or at least one assert misbehaves, as reported by Ildar Musin.
> >
> >     Diagnosed-by: Masahiko Sawada
> >     Author: Masahiko Sawada
> >     Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com
> >
> > applied?
>
> Yeah, I tried fresh master. See below: skipped xl_xact_assignment is
> beyond restart_lsn at all (and thus not read), so this doesn't matter.
>
>
> >> The check leading to this ERROR is too strict, it forbids legit behaviours. Say
> >> we have in WAL
> >>
> >> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
> >>
> >> - First xl_xact_assignment record is beyond reading, i.e. earlier
> >>   restart_lsn, where ready snapshot will be taken from disk.
> >> - After restart_lsn there is some change of a subxact.
> >> - After that, there is second xl_xact_assignment (for another subxact)
> >>   revealing relationship between top and first subxact, where this ERROR fires.
> >>
> >> Such transaction won't be streamed because we hadn't seen it in full. It must be
> >> finished before streaming will start, i.e. before confirmed_flush_lsn.
> >>
> >> Of course, the easiest fix is to just throw the check out.
> >
> > I don't think that'd actually be a fix, and just hiding a bug.
>
> I don't see a bug here. At least in reproduced scenario I see false
> alert, as explained above: transaction with skipped xl_xact_assignment
> won't be streamed as it finishes before confirmed_flush_lsn.
>

Does this guarantee come from the fact that we need to wait for such a
transaction before reaching a consistent snapshot state?  If not, can
you explain a bit more what makes you say so?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

>> I don't see a bug here. At least in reproduced scenario I see false
>> alert, as explained above: transaction with skipped xl_xact_assignment
>> won't be streamed as it finishes before confirmed_flush_lsn.
>>
>
> Does this guarantee come from the fact that we need to wait for such a
> transaction before reaching a consistent snapshot state?  If not, can
> you explain a bit more what makes you say so?

Right, see FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition -- it exists
exactly for this purpose: once we have good snapshot, we need to wait
for all running xacts to finish to see all xacts we are promising to
stream in full. This ensures <restart_lsn, confirmed_flush_lsn> pair is
good (reading WAL since the former is enough to see all xacts committing
after the latter in full) initially, and slot advancement arrangements
ensure it stays good forever (see
LogicalIncreaseRestartDecodingForSlot).

Well, almost. This is true as long initial snapshot construction process
goes the long way of building the snapshot by itself. If it happens to
pick up from disk ready snapshot pickled there by another decoding
session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a
bug as described in
https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad

In theory, this bug could indeed lead to 'subtransaction logged without
previous top-level txn record' error. In practice, I think its
possibility is disappearingly small -- process of slot creation must be
intervened in a very short gap by another decoder who serializes its
snapshot (see the exact sequence of steps in the mail above). What is
much more probable (doesn't involve new slot creation and relatively
easily reproducible without sleeps) is false alert triggered by unlucky
position of restart_lsn.


Surely we still must fix it. I just mean
  - People definitely encountered false alert, not this bug
    (at least because nobody said this was immediately after slot
    creation).
  - I've no bright ideas how to relax the check to make it proper
    without additional complications and I'm pretty sure this is
    impossible (again, see above for details), so I'd remove it.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Mon, Feb 3, 2020 at 2:50 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> >> I don't see a bug here. At least in reproduced scenario I see false
> >> alert, as explained above: transaction with skipped xl_xact_assignment
> >> won't be streamed as it finishes before confirmed_flush_lsn.
> >>
> >
> > Does this guarantee come from the fact that we need to wait for such a
> > transaction before reaching a consistent snapshot state?  If not, can
> > you explain a bit more what makes you say so?
>
> Right, see FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition -- it exists
> exactly for this purpose: once we have good snapshot, we need to wait
> for all running xacts to finish to see all xacts we are promising to
> stream in full.
>

So, doesn't this mean that it started occurring after the fix done in
commit 96b5033e11 [1]?  Because before that fix we wouldn't have
allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
SNAPBUILD_FULL_SNAPSHOT state.  I am not telling the fix in that
commit is wrong, but just trying to understand the situation here.

>
> Well, almost. This is true as long initial snapshot construction process
> goes the long way of building the snapshot by itself. If it happens to
> pick up from disk ready snapshot pickled there by another decoding
> session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a
> bug as described in
> https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad
>

Can't we deal with this separately?  If so, I think let's not mix the
discussions for both as the root cause of both seems different.


[1] -
 commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
 Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   2019-09-13 16:36:28 -0300

     logical decoding: process ASSIGNMENT during snapshot build



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

> So, doesn't this mean that it started occurring after the fix done in
> commit 96b5033e11 [1]?  Because before that fix we wouldn't have
> allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
> SNAPBUILD_FULL_SNAPSHOT state.  I am not telling the fix in that
> commit is wrong, but just trying to understand the situation here.

Nope. Consider again example of WAL above triggering the error:

[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]

Decoder starting reading WAL at <restart_lsn> where he immediately reads
from disk snapshot serialized earlier, which makes it jump to
SNAPBUILD_CONSISTENT right away. It doesn't read xl_xact_assignment_1,
but it reads xl_xact_assignment_2 already in SNAPBUILD_CONSISTENT state,
so catches the error regardless of this commit.

>> Well, almost. This is true as long initial snapshot construction process
>> goes the long way of building the snapshot by itself. If it happens to
>> pick up from disk ready snapshot pickled there by another decoding
>> session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a
>> bug as described in
>> https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad
>>
>
> Can't we deal with this separately?  If so, I think let's not mix the
> discussions for both as the root cause of both seems different.

These issues are related: before removing the check it would be nice to
ensure that there is no bugs it might protect us from (and it turns out
there actually is, though it won't always protect, and though this bug
has very small probability). Moreover, they are about more or less
subject -- avoiding partially decoded xacts -- and once you dived deep
enough to deal with one, it is reasonable to deal with another instead
of doing that twice. But as a practical matter, removing the check is
simple one-liner, and its presence causes people troubles -- so I'd
suggest doing that first and then deal with the rest. I don't think
starting new thread is worthwhile here, but if you think it does, I can
create it.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> > So, doesn't this mean that it started occurring after the fix done in
> > commit 96b5033e11 [1]?  Because before that fix we wouldn't have
> > allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
> > SNAPBUILD_FULL_SNAPSHOT state.  I am not telling the fix in that
> > commit is wrong, but just trying to understand the situation here.
>
> Nope. Consider again example of WAL above triggering the error:
>
> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]
>
> Decoder starting reading WAL at <restart_lsn> where he immediately reads
> from disk snapshot serialized earlier, which makes it jump to
> SNAPBUILD_CONSISTENT right away.
>

Sure, I understand that if we get serialized snapshot from disk, this
problem can occur and probably we can fix by ignoring serialized
snapshots during slot creation or something on those lines.  However,
what I am trying to understand is whether this can occur from another
path as well.  I think it might occur without using serialized
snapshots as well because we allow decoding xl_xact_assignment record
even when the snapshot state is not full.  Say in your above example,
even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't
used the serialized snapshot, then also, it can lead to the above
problem due to decoding of xl_xact_assignment.  I have not tried to
generate a test case for this, so I could easily be wrong here.

What I am trying to get at is if the problem can only occur by using
serialized snapshots and we fix it by somehow not using them initial
slot creation, then ideally we don't need to remove the error in
ReorderBufferAssignChild, but if that is not the case, then we need to
discuss other cases as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

> On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>> Amit Kapila <amit.kapila16@gmail.com> writes:
>>
>> > So, doesn't this mean that it started occurring after the fix done in
>> > commit 96b5033e11 [1]?  Because before that fix we wouldn't have
>> > allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
>> > SNAPBUILD_FULL_SNAPSHOT state.  I am not telling the fix in that
>> > commit is wrong, but just trying to understand the situation here.
>>
>> Nope. Consider again example of WAL above triggering the error:
>>
>> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]
>>
>> Decoder starting reading WAL at <restart_lsn> where he immediately reads
>> from disk snapshot serialized earlier, which makes it jump to
>> SNAPBUILD_CONSISTENT right away.
>>
>
> Sure, I understand that if we get serialized snapshot from disk, this
> problem can occur and probably we can fix by ignoring serialized
> snapshots during slot creation or something on those lines.

There is some confusion. I'll try to reword what we have here:

1) Decoding from existing slot (*not* initial snapshot construction)
starts up, immediately picks up snapshot at restart_lsn (getting into
SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
hadn't seen in full (no toplevel records) transaction which it is not
even going to stream -- but still dies with "subtransation logged
without...". That's my example above, and that's what people are
complaining about. Here, usage of serialized snapshot and jump to
SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
able to stream data since confirmed_flush_lsn, we must pick it up as we
might not be able to assemble it from scratch in time. I mean, what is
wrong here is not serialized snapshot usage but the check.

(Lengthy comment to AllocateSnapshotBuilder in my
0001-Fix-serialized-snapshot-usage-for-new-logical-slots.patch explains
why snapbuilder is not able to do FULL -> CONSISTENT transition on its
own early enough for decoding from existing slot, so the jump on
snapshot pickup is performed to CONSISTENT directly.)

This happens with or without bac2fae05c.

2) *New* slot creationg process picks up serialized snapshot and jumps
to CONSISTENT without waiting for all running xacts to finish. This is
wrong and is a bug (of very low probability), as we risk promising to
decode xacts which we might not seen in full. Sometimes it could be
arrested by "subtransation logged without..." check, but not necessarily
-- e.g. there could be no subxacts at all.


> However,
> what I am trying to understand is whether this can occur from another
> path as well.  I think it might occur without using serialized
> snapshots as well because we allow decoding xl_xact_assignment record
> even when the snapshot state is not full.  Say in your above example,
> even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't
> used the serialized snapshot, then also, it can lead to the above
> problem due to decoding of xl_xact_assignment.  I have not tried to
> generate a test case for this, so I could easily be wrong here.

What you are suggesting here is 3), which is, well, sort of form of 1),
meaning "subxact logged..." error also pointlessly triggered, but for
new slot creation. With bac2fae0, decoder might miss first
xl_xact_assignment because it is beyond start of reading WAL but
encounter second xl_xact_assignment and die on it due to this check
before even getting FULL.

But now that I'm thinking about it, I suspect that similar could happen
even before bac2fae0. Imagine

<start_of_reading_wal> <xl_xact_assignment_1> <SNAPBUILD_FULL> <subxact_change> <xl_xact_assignment_2> <commit> ...
<SNAPBUILD_CONSISTENT>

Before bac2fae0, xl_xact_assignment_1 was ignored, so
xl_xact_assignment_1 would trigger the error.

> What I am trying to get at is if the problem can only occur by using
> serialized snapshots and we fix it by somehow not using them initial
> slot creation, then ideally we don't need to remove the error in
> ReorderBufferAssignChild, but if that is not the case, then we need to
> discuss other cases as well.

So, 1) and 3) mean this is not the case.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Wed, Feb 5, 2020 at 2:34 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> > On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
> >> Amit Kapila <amit.kapila16@gmail.com> writes:
> >>
> >> > So, doesn't this mean that it started occurring after the fix done in
> >> > commit 96b5033e11 [1]?  Because before that fix we wouldn't have
> >> > allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
> >> > SNAPBUILD_FULL_SNAPSHOT state.  I am not telling the fix in that
> >> > commit is wrong, but just trying to understand the situation here.
> >>
> >> Nope. Consider again example of WAL above triggering the error:
> >>
> >> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]
> >>
> >> Decoder starting reading WAL at <restart_lsn> where he immediately reads
> >> from disk snapshot serialized earlier, which makes it jump to
> >> SNAPBUILD_CONSISTENT right away.
> >>
> >
> > Sure, I understand that if we get serialized snapshot from disk, this
> > problem can occur and probably we can fix by ignoring serialized
> > snapshots during slot creation or something on those lines.
>
> There is some confusion. I'll try to reword what we have here:
>
> 1) Decoding from existing slot (*not* initial snapshot construction)
> starts up, immediately picks up snapshot at restart_lsn (getting into
> SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
> hadn't seen in full (no toplevel records) transaction which it is not
> even going to stream -- but still dies with "subtransation logged
> without...". That's my example above, and that's what people are
> complaining about. Here, usage of serialized snapshot and jump to
> SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
> able to stream data since confirmed_flush_lsn, we must pick it up as we
> might not be able to assemble it from scratch in time. I mean, what is
> wrong here is not serialized snapshot usage but the check.
>

I was thinking if we have some way to skip processing of
xl_xact_assignment for such cases, then it might be better.  Say,
along with restart_lsn, if have some way to find corresponding nextXid
(below which we don't need to process records).  Say, if, during
SnapBuildProcessRunningXacts, we record the xid of txn we got by
ReorderBufferGetOldestTXN in slot, then can't we use it to skip such
records.

> (Lengthy comment to AllocateSnapshotBuilder in my
> 0001-Fix-serialized-snapshot-usage-for-new-logical-slots.patch explains
> why snapbuilder is not able to do FULL -> CONSISTENT transition on its
> own early enough for decoding from existing slot, so the jump on
> snapshot pickup is performed to CONSISTENT directly.)
>
> This happens with or without bac2fae05c.
>
> 2) *New* slot creationg process picks up serialized snapshot and jumps
> to CONSISTENT without waiting for all running xacts to finish. This is
> wrong and is a bug (of very low probability), as we risk promising to
> decode xacts which we might not seen in full. Sometimes it could be
> arrested by "subtransation logged without..." check, but not necessarily
> -- e.g. there could be no subxacts at all.
>
>
> > However,
> > what I am trying to understand is whether this can occur from another
> > path as well.  I think it might occur without using serialized
> > snapshots as well because we allow decoding xl_xact_assignment record
> > even when the snapshot state is not full.  Say in your above example,
> > even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't
> > used the serialized snapshot, then also, it can lead to the above
> > problem due to decoding of xl_xact_assignment.  I have not tried to
> > generate a test case for this, so I could easily be wrong here.
>
> What you are suggesting here is 3), which is, well, sort of form of 1),
> meaning "subxact logged..." error also pointlessly triggered, but for
> new slot creation. With bac2fae0, decoder might miss first
> xl_xact_assignment because it is beyond start of reading WAL but
> encounter second xl_xact_assignment and die on it due to this check
> before even getting FULL.
>
> But now that I'm thinking about it, I suspect that similar could happen
> even before bac2fae0. Imagine
>
> <start_of_reading_wal> <xl_xact_assignment_1> <SNAPBUILD_FULL> <subxact_change> <xl_xact_assignment_2> <commit> ...
<SNAPBUILD_CONSISTENT>
>
> Before bac2fae0, xl_xact_assignment_1 was ignored, so
> xl_xact_assignment_1 would trigger the error.
>

'xl_xact_assignment_1 would trigger the error', I think in this part
of sentence you mean to say xl_xact_assignment_2 because we won't try
to decode xl_xact_assignment_1 before bac2fae0.  If so, won't we wait
for such a transaction to finish while changing the snapshot state
from SNAPBUILD_BUILDING_SNAPSHOT to SNAPBUILD_FULL_SNAPSHOT?  And if
the transaction is finished, ideally, we should not try to decode its
WAL and or create its ReorderBufferTxn.

> > What I am trying to get at is if the problem can only occur by using
> > serialized snapshots and we fix it by somehow not using them initial
> > slot creation, then ideally we don't need to remove the error in
> > ReorderBufferAssignChild, but if that is not the case, then we need to
> > discuss other cases as well.
>
> So, 1) and 3) mean this is not the case.
>

Right, I am thinking that if we can find some way to skip the xact
assignment for (1) and (3), then that might be a better fix.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Fri, Feb 7, 2020 at 4:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 5, 2020 at 2:34 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
> >
>
> > > What I am trying to get at is if the problem can only occur by using
> > > serialized snapshots and we fix it by somehow not using them initial
> > > slot creation, then ideally we don't need to remove the error in
> > > ReorderBufferAssignChild, but if that is not the case, then we need to
> > > discuss other cases as well.
> >
> > So, 1) and 3) mean this is not the case.
> >
>
> Right, I am thinking that if we can find some way to skip the xact
> assignment for (1) and (3), then that might be a better fix.
>

Just to be clear, I am just brainstorming the ideas to see if we can
find some better solutions to the problems (1) and (3) described by
Arseny in the above email [1].  At this stage, it is not clear to me
that we have a fix simple enough to backpatch apart from what Arseny
posted in his fist email [2] (which is to stop demanding that top xact
must be seen before subxact in decoding.).

[1] - https://www.postgresql.org/message-id/87zhdx76d5.fsf%40ars-thinkpad
[2] - https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

>> 1) Decoding from existing slot (*not* initial snapshot construction)
>> starts up, immediately picks up snapshot at restart_lsn (getting into
>> SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
>> hadn't seen in full (no toplevel records) transaction which it is not
>> even going to stream -- but still dies with "subtransation logged
>> without...". That's my example above, and that's what people are
>> complaining about. Here, usage of serialized snapshot and jump to
>> SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
>> able to stream data since confirmed_flush_lsn, we must pick it up as we
>> might not be able to assemble it from scratch in time. I mean, what is
>> wrong here is not serialized snapshot usage but the check.
>>
>
> I was thinking if we have some way to skip processing of
> xl_xact_assignment for such cases, then it might be better.  Say,
> along with restart_lsn, if have some way to find corresponding nextXid
> (below which we don't need to process records).

I don't believe you can that without persisting additional
data. Basically, what we need is list of transactions who are running at
the point of snapshot serialization *and* already wrote something before
it -- those we hadn't seen in full and can't decode. We have no such
data currently. The closest thing we have is xl_running_xacts->nextXid,
but

  1) issued xid doesn't necessarily means xact actually wrote something,
     so we can't just skip xl_xact_assignment for xid < nextXid, it might
     still be decoded
  2) snapshot might be serialized not at xl_running_xacts anyway

Surely this thing doesn't deserve changing persisted data format.


Somehow I hadn't realized this earlier, so my comments/commit messages
in patches above were not accurate here; I've edited them. Also in the
first patch serialized snapshots are not no longer used for new slot
creation at all, as Andres suggested above. This is not principal, as I
said, but arguably makes things simpler a bit.

I've also found a couple of issues with slot copying feature, will post
in separate thread on them.


From 8952bc955585d12a06cdd6d339bcbabafc772768 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Tue, 22 Oct 2019 19:02:14 +0300
Subject: [PATCH 1/2] Don't use serialized snapshots during logical slot
 creation.

snapbuild.c enters SNAPBUILD_CONSISTENT immediately after deserializing the
snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
just complete snapshot, but also reorderbuffer filled with all currently running
xacts. This is fine for decoding sessions with existing slots though as they
won't stream anything before confirmed_flush_lsn is reached anyway, at which
point all transactions which hadn't got into reorderbuffer would definitely
finish. However, new slots might be created too early, thus losing (not
decoding) parts of transactions committed after freshly created
confirmed_flush_lsn. This can happen under the following extremely unlucky
circumstances:
 - New slot creation reserves point in WAL since which it would read it
   (GetXLogInsertRecPtr);
 - It logs xl_running_xacts to start assembling a snapshot;
 - Running decoding session with another slot quickly reads this
   xl_running_xacts and serializes its snapshot;
 - New slot reads xl_running_xacts and picks this snapshot up, saying that it
   is ready to stream henceforth, though its reorderbuffer is empty.

Fix by forbidding to use serialized snapshots during slot creation. Actually we
could use snapshot if it was picked at xl_running_xacts where we could perform
usual FULL -> CONSISTENT transition conservatively via nextXid, but that seems
like a complication for a too small benefit.
---
 src/backend/replication/logical/logical.c   |  2 +-
 src/backend/replication/logical/snapbuild.c | 85 +++++++++++++++++++++++------
 2 files changed, 70 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index da265f5294..67ed52e56f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin,
     ReplicationSlotMarkDirty();
     ReplicationSlotSave();
 
-    ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
+    ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon,
                                  need_full_snapshot, false,
                                  read_page, prepare_write, do_write,
                                  update_progress);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0bd1d0f954..6f503a93aa 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,7 +165,7 @@ struct SnapBuild
 
     /*
      * Don't replay commits from an LSN < this LSN. This can be set externally
-     * but it will also be advanced (never retreat) from within snapbuild.c.
+     * or established by snapbuild.c once consistent snapshot is assembled.
      */
     XLogRecPtr    start_decoding_at;
 
@@ -309,7 +309,16 @@ SnapBuildStartNextPhaseAt(SnapBuild *builder, TransactionId at)
  * Allocate a new snapshot builder.
  *
  * xmin_horizon is the xid >= which we can be sure no catalog rows have been
- * removed, start_lsn is the LSN >= we want to replay commits.
+ * removed.
+ * start_lsn is InvalidXLogRecPtr or LSN >= we want to replay commits.
+ *  InvalidXLogRecPtr is used during slot creation; snapbuild will assemble
+ *  consistent snapshot and (if continue decoding -- no core code does that
+ *  currently) stream all further commits.
+ *  If normal LSN is passed, caller *must* be sure that WAL reading starts
+ *  early enough to get the snapshot and pick up the first change of
+ *  earliest xact to stream. Slot creation and advancement machinery
+ *  guarantees that slot's <restart_lsn, confirmed_flush_lsn> pair always
+ *  satisfies this.
  */
 SnapBuild *
 AllocateSnapshotBuilder(ReorderBuffer *reorder,
@@ -408,7 +417,8 @@ SnapBuildCurrentState(SnapBuild *builder)
 bool
 SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
 {
-    return ptr < builder->start_decoding_at;
+    return XLogRecPtrIsInvalid(builder->start_decoding_at) ||
+        ptr < builder->start_decoding_at;
 }
 
 /*
@@ -945,16 +955,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
          TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder))))
     {
         /* ensure that only commits after this are getting replayed */
-        if (builder->start_decoding_at <= lsn)
-            builder->start_decoding_at = lsn + 1;
+        Assert(XLogRecPtrIsInvalid(builder->start_decoding_at)    ||
+               builder->start_decoding_at > lsn);
         return;
     }
 
     if (builder->state < SNAPBUILD_CONSISTENT)
     {
         /* ensure that only commits after this are getting replayed */
-        if (builder->start_decoding_at <= lsn)
-            builder->start_decoding_at = lsn + 1;
+        Assert(XLogRecPtrIsInvalid(builder->start_decoding_at)    ||
+               builder->start_decoding_at > lsn);
 
         /*
          * If building an exportable snapshot, force xid to be tracked, even
@@ -966,6 +976,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
         }
     }
 
+    if (!XLogRecPtrIsInvalid(builder->start_decoding_at) &&
+        builder->start_decoding_at <= lsn)
+    {
+        /*
+         * We are going to stream this xact, so must already have fine
+         * snapshot.
+         */
+        Assert(builder->state == SNAPBUILD_CONSISTENT);
+    }
+
     for (nxact = 0; nxact < nsubxacts; nxact++)
     {
         TransactionId subxid = subxacts[nxact];
@@ -1250,10 +1270,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
      */
     if (running->oldestRunningXid == running->nextXid)
     {
-        if (builder->start_decoding_at == InvalidXLogRecPtr ||
-            builder->start_decoding_at <= lsn)
+        if (XLogRecPtrIsInvalid(builder->start_decoding_at))
             /* can decode everything after this */
             builder->start_decoding_at = lsn + 1;
+        Assert(builder->start_decoding_at >= lsn);
 
         /* As no transactions were running xmin/xmax can be trivially set. */
         builder->xmin = running->nextXid;    /* < are finished */
@@ -1277,7 +1297,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
     else if (!builder->building_full_snapshot &&
              SnapBuildRestore(builder, lsn))
     {
-        /* there won't be any state to cleanup */
+        /* there won't be much state to cleanup */
         return false;
     }
 
@@ -1358,6 +1378,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
     {
         builder->state = SNAPBUILD_CONSISTENT;
         SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
+        if (XLogRecPtrIsInvalid(builder->start_decoding_at))
+            /* can decode everything after this */
+            builder->start_decoding_at = lsn + 1;
+        Assert(builder->start_decoding_at >= lsn);
 
         ereport(LOG,
                 (errmsg("logical decoding found consistent point at %X/%X",
@@ -1499,8 +1523,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
            builder->last_serialized_snapshot <= lsn);
 
     /*
-     * no point in serializing if we cannot continue to work immediately after
-     * restoring the snapshot
+     * No point in serializing if the snapshot is not complete (< FULL).
+     * FULL snapshot per se is just as good as CONSISTENT; difference between
+     * these states is not a snapshot property, but whether we have filled
+     * reorderbuffer with all currently running xacts. However, there isn't
+     * much sense in serializing it as serialized snaps are used only for
+     * decoding from existing slots, which had already reached CONSISTENT.
      */
     if (builder->state < SNAPBUILD_CONSISTENT)
         return;
@@ -1688,7 +1716,8 @@ out:
 
 /*
  * Restore a snapshot into 'builder' if previously one has been stored at the
- * location indicated by 'lsn'. Returns true if successful, false otherwise.
+ * location indicated by 'lsn'.
+ * Returns true if successful, false otherwise.
  */
 static bool
 SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
@@ -1704,6 +1733,16 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
     if (builder->state == SNAPBUILD_CONSISTENT)
         return false;
 
+    /*
+     * Don't use serialized snapshot if we are not sure where all
+     * currently running xacts will finish (new slot creation).
+     * (Actually, if we came here through xl_running_xacts, we could perform
+     * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly,
+     * but added lines of code would hardly worth the benefit.)
+     */
+    if (builder->start_decoding_at == InvalidXLogRecPtr)
+        return false;
+
     sprintf(path, "pg_logical/snapshots/%X-%X.snap",
             (uint32) (lsn >> 32), (uint32) lsn);
 
@@ -1888,7 +1927,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
     /* ok, we think the snapshot is sensible, copy over everything important */
     builder->xmin = ondisk.builder.xmin;
     builder->xmax = ondisk.builder.xmax;
-    builder->state = ondisk.builder.state;
 
     builder->committed.xcnt = ondisk.builder.committed.xcnt;
     /* We only allocated/stored xcnt, not xcnt_space xids ! */
@@ -1901,6 +1939,23 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
     }
     ondisk.builder.committed.xip = NULL;
 
+    /*
+     * We probably don't have in reorderbuffer parts of currently running
+     * xacts right now. However, snapbuild can't determine *earliest* spot
+     * where decoding is safe (we'd see all xacts in full) on its own, because
+     * this requires to know which xids already wrote something, and such
+     * writes are beyond our reading of WAL. xl_running_xacts->nextXid
+     * doesn't help us here, because 1) issued xid doesn't necessarily means
+     * xact wrote something, so there would be a risk of arriving at
+     * CONSISTENT a bit later than possible which is ok for new slot creation,
+     * but unacceptable otherwise; 2) snapshot might be serialized at some
+     * other point.
+     *
+     * Slot must have provided as with safe start_decoding_at though, so just
+     * skip FULL -> CONSISTENT transition and set CONSISTENT directly.
+     */
+    builder->state = SNAPBUILD_CONSISTENT;
+
     /* our snapshot is not interesting anymore, build a new one */
     if (builder->snapshot != NULL)
     {
@@ -1911,8 +1966,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
     ReorderBufferSetRestartPoint(builder->reorder, lsn);
 
-    Assert(builder->state == SNAPBUILD_CONSISTENT);
-
     ereport(LOG,
             (errmsg("logical decoding found consistent point at %X/%X",
                     (uint32) (lsn >> 32), (uint32) lsn),
-- 
2.11.0

From 1449c8aacd2feadfa104a1ea4b73030e37cbba14 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Wed, 23 Oct 2019 15:56:46 +0300
Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact
 in decoding.

Manifested as

ERROR:  subtransaction logged without previous top-level txn record

this check forbids legit behaviours like
 - First xl_xact_assignment record is beyond reading, i.e. earlier
   restart_lsn.
 - After restart_lsn there is some change of a subxact.
 - After that, there is second xl_xact_assignment (for another subxact)
   revealing relationship between top and first subxact.

Such transaction won't be streamed anyway because we hadn't seen it in full;
confirmed_flush_lsn must be past all these records. Saying for sure whether xact
of some record encountered after snapshot was deserialized can be streamed or
not requires to know whether it wrote something before deserialization point --
if yes, it hasn't been seen in full and can't be decoded. Snapshot doesn't have
such info, so there is no easy way to relax the check.
---
 src/backend/replication/logical/reorderbuffer.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8ce28ad629..6faba6077e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
     txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
     subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
 
-    if (new_top && !new_sub)
-        elog(ERROR, "subtransaction logged without previous top-level txn record");
-
     if (!new_sub)
     {
         if (subtxn->is_known_as_subxact)
-- 
2.11.0



--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Sun, Feb 9, 2020 at 9:37 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> >> 1) Decoding from existing slot (*not* initial snapshot construction)
> >> starts up, immediately picks up snapshot at restart_lsn (getting into
> >> SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
> >> hadn't seen in full (no toplevel records) transaction which it is not
> >> even going to stream -- but still dies with "subtransation logged
> >> without...". That's my example above, and that's what people are
> >> complaining about. Here, usage of serialized snapshot and jump to
> >> SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
> >> able to stream data since confirmed_flush_lsn, we must pick it up as we
> >> might not be able to assemble it from scratch in time. I mean, what is
> >> wrong here is not serialized snapshot usage but the check.
> >>
> >
> > I was thinking if we have some way to skip processing of
> > xl_xact_assignment for such cases, then it might be better.  Say,
> > along with restart_lsn, if have some way to find corresponding nextXid
> > (below which we don't need to process records).
>
> I don't believe you can that without persisting additional
> data. Basically, what we need is list of transactions who are running at
> the point of snapshot serialization *and* already wrote something before
> it -- those we hadn't seen in full and can't decode. We have no such
> data currently. The closest thing we have is xl_running_xacts->nextXid,
> but
>
>   1) issued xid doesn't necessarily means xact actually wrote something,
>      so we can't just skip xl_xact_assignment for xid < nextXid, it might
>      still be decoded
>   2) snapshot might be serialized not at xl_running_xacts anyway
>
> Surely this thing doesn't deserve changing persisted data format.
>

I agree that it won't be a good idea to change the persisted data
format, especially in back-branches.  I don't see any fix which can
avoid this without doing major changes in the code.  Apart from this,
we have to come up with a solution for point (3) discussed in the
above email [1] which again could be change in design.  I think we can
first try to proceed with the patch
0002-Stop-demanding-that-top-xact-must-be-seen-before--v2 and then we
can discuss the other patch.  I can't see a way to write a test case
for this, can you think of any way?

Andres, anyone else, if you have a better idea other than changing the
code (removing the expected error) as in
0002-Stop-demanding-that-top-xact-must-be-seen-before--v2, then
please, let us know. You can read the points (1) and (3) in the email
above [1] where the below error check will hit for valid cases.  We
have discussed this in detail, but couldn't come up with anything
better than to remove this check.

@@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb,
TransactionId xid,
  txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
  subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);

- if (new_top && !new_sub)
- elog(ERROR, "subtransaction logged without previous top-level txn record");
-

[1] - https://www.postgresql.org/message-id/87zhdx76d5.fsf%40ars-thinkpad

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

>> I don't believe you can that without persisting additional
>> data. Basically, what we need is list of transactions who are running at
>> the point of snapshot serialization *and* already wrote something before
>> it -- those we hadn't seen in full and can't decode. We have no such
>> data currently. The closest thing we have is xl_running_xacts->nextXid,
>> but
>>
>>   1) issued xid doesn't necessarily means xact actually wrote something,
>>      so we can't just skip xl_xact_assignment for xid < nextXid, it might
>>      still be decoded
>>   2) snapshot might be serialized not at xl_running_xacts anyway
>>
>> Surely this thing doesn't deserve changing persisted data format.
>>
>
> I agree that it won't be a good idea to change the persisted data
> format, especially in back-branches.  I don't see any fix which can
> avoid this without doing major changes in the code.  Apart from this,
> we have to come up with a solution for point (3) discussed in the
> above email [1] which again could be change in design.  I think we can
> first try to proceed with the patch
> 0002-Stop-demanding-that-top-xact-must-be-seen-before--v2 and then we
> can discuss the other patch.  I can't see a way to write a test case
> for this, can you think of any way?

Yeah, let's finally get it.

Attached is raw version of isolation test triggering false
'subtransaction logged without...' (case (1)). However, frankly I don't
see much value in it, so I'm dubious whether it should be included in
the patch.

diff --git a/contrib/test_decoding/specs/subxact_logged_without_top.spec
b/contrib/test_decoding/specs/subxact_logged_without_top.spec
new file mode 100644
index 0000000000..55b51357a9
--- /dev/null
+++ b/contrib/test_decoding/specs/subxact_logged_without_top.spec
@@ -0,0 +1,51 @@
+
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write
inxact
 
+    CREATE TABLE harvest(apples integer);
+    CREATE OR REPLACE FUNCTION subxacts() returns void as $$
+    BEGIN
+      FOR i in 1 .. 65 LOOP
+        BEGIN
+          INSERT INTO harvest VALUES (42);
+        EXCEPTION
+        WHEN OTHERS THEN
+      RAISE;
+        END;
+    END LOOP;
+    END; $$LANGUAGE 'plpgsql';
+}
+
+teardown
+{
+    DROP TABLE IF EXISTS harvest;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_begin" { BEGIN; }
+step "s0_first_subxact" {
+    DO LANGUAGE plpgsql $$
+      BEGIN
+        BEGIN
+          INSERT INTO harvest VALUES (41);
+    EXCEPTION WHEN OTHERS THEN RAISE;
+    END;
+      END $$;
+}
+step "s0_many_subxacts" { select subxacts(); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_begin" { BEGIN; }
+step "s1_dml" { INSERT INTO harvest VALUES (43); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+step "s2_checkpoint" { CHECKPOINT; }
+step "s2_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids',
'0','skip-empty-xacts', '1'); }
 
+
+permutation "s0_begin" "s0_first_subxact" "s2_checkpoint" "s1_begin" "s1_dml" "s0_many_subxacts" "s0_commit"
"s2_checkpoint""s2_get_changes" "s1_commit" "s2_get_changes" 


-- cheers, Arseny

Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Mon, Feb 10, 2020 at 6:34 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> >> I don't believe you can that without persisting additional
> >> data. Basically, what we need is list of transactions who are running at
> >> the point of snapshot serialization *and* already wrote something before
> >> it -- those we hadn't seen in full and can't decode. We have no such
> >> data currently. The closest thing we have is xl_running_xacts->nextXid,
> >> but
> >>
> >>   1) issued xid doesn't necessarily means xact actually wrote something,
> >>      so we can't just skip xl_xact_assignment for xid < nextXid, it might
> >>      still be decoded
> >>   2) snapshot might be serialized not at xl_running_xacts anyway
> >>
> >> Surely this thing doesn't deserve changing persisted data format.
> >>
> >
> > I agree that it won't be a good idea to change the persisted data
> > format, especially in back-branches.  I don't see any fix which can
> > avoid this without doing major changes in the code.  Apart from this,
> > we have to come up with a solution for point (3) discussed in the
> > above email [1] which again could be change in design.  I think we can
> > first try to proceed with the patch
> > 0002-Stop-demanding-that-top-xact-must-be-seen-before--v2 and then we
> > can discuss the other patch.  I can't see a way to write a test case
> > for this, can you think of any way?
>
> Yeah, let's finally get it.
>
> Attached is raw version of isolation test triggering false
> 'subtransaction logged without...' (case (1)).
>

This didn't reproduce the desired error for me (tried without a
patch).  I think you need to add two more steps ("s2_checkpoint"
"s2_get_changes") at the end of the test to set the restart_lsn at the
appropriate location.

> However, frankly I don't
> see much value in it, so I'm dubious whether it should be included in
> the patch.
>

I think this will surely test some part of the system which was not
tested before, mainly having some subxacts without top-xact getting
decoded even though we don't need to send such a transaction.  Can you
prepare a complete patch (for
Stop-demanding-that-top-xact-must-be-seen-before-sub) having this test
as part of it?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

>> Attached is raw version of isolation test triggering false
>> 'subtransaction logged without...' (case (1)).
>>
>
> This didn't reproduce the desired error for me (tried without a
> patch).  I think you need to add two more steps ("s2_checkpoint"
> "s2_get_changes") at the end of the test to set the restart_lsn at the
> appropriate location.

That's weird, it reliably fails with expected error for me. There are
already two s2_checkpoint's: first establishes potential (broken)
restart_lsn (serializes snapshot after first xl_xact_assignment of s0
xact, but before first record of s1 xact), the second ensures
s2_get_changes directly following it will actually advance the slot,
making that potential restart_lsn real.

I don't see how adding s2_checkpoint and s2_get_changes helps here. Do
they really provoke error in your setup? Could you check with pg_waldump
what's going on?

>
>> However, frankly I don't
>> see much value in it, so I'm dubious whether it should be included in
>> the patch.
>>
>
> I think this will surely test some part of the system which was not
> tested before, mainly having some subxacts without top-xact getting
> decoded even though we don't need to send such a transaction.  Can you
> prepare a complete patch (for
> Stop-demanding-that-top-xact-must-be-seen-before-sub) having this test
> as part of it?

Ok, will do.


-- cheers, arseny



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
>> I think this will surely test some part of the system which was not
>> tested before, mainly having some subxacts without top-xact getting
>> decoded even though we don't need to send such a transaction.  Can you
>> prepare a complete patch (for
>> Stop-demanding-that-top-xact-must-be-seen-before-sub) having this test
>> as part of it?
>
> Ok, will do.

Here it is.

From 3748a87618dfb7379565884655f4c91b22724595 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Wed, 23 Oct 2019 15:56:46 +0300
Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact
 in decoding.

Manifested as

ERROR:  subtransaction logged without previous top-level txn record

this check forbids legit behaviours like
 - First xl_xact_assignment record is beyond reading, i.e. earlier
   restart_lsn.
 - After restart_lsn there is some change of a subxact.
 - After that, there is second xl_xact_assignment (for another subxact)
   revealing relationship between top and first subxact.

Such transaction won't be streamed anyway because we hadn't seen it in full;
confirmed_flush_lsn must be past all these records. Saying for sure whether xact
of some record encountered after snapshot was deserialized can be streamed or
not requires to know whether it wrote something before deserialization point --
if yes, it hasn't been seen in full and can't be decoded. Snapshot doesn't have
such info, so there is no easy way to relax the check.
---
 contrib/test_decoding/Makefile                     |  2 +-
 .../test_decoding/expected/subxact_without_top.out | 40 +++++++++++++
 .../test_decoding/specs/subxact_without_top.spec   | 65 ++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |  3 -
 4 files changed, 106 insertions(+), 4 deletions(-)
 create mode 100644 contrib/test_decoding/expected/subxact_without_top.out
 create mode 100644 contrib/test_decoding/specs/subxact_without_top.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 4afb1d963e..f439c582a5 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
     decoding_into_rel binary prepared replorigin time messages \
     spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-    oldest_xmin snapshot_transfer
+    oldest_xmin snapshot_transfer subxact_without_top
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/subxact_without_top.out
b/contrib/test_decoding/expected/subxact_without_top.out
new file mode 100644
index 0000000000..357e0fd964
--- /dev/null
+++ b/contrib/test_decoding/expected/subxact_without_top.out
@@ -0,0 +1,40 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s0_begin s0_first_subxact s2_checkpoint s1_begin s1_dml s0_many_subxacts s0_commit s2_checkpoint
s2_get_changes_suppress_outputs1_commit s2_get_changes
 
+step s0_begin: BEGIN;
+step s0_first_subxact: 
+    DO LANGUAGE plpgsql $$
+      BEGIN
+        BEGIN
+          INSERT INTO harvest VALUES (41);
+    EXCEPTION WHEN OTHERS THEN RAISE;
+    END;
+      END $$;
+
+step s2_checkpoint: CHECKPOINT;
+step s1_begin: BEGIN;
+step s1_dml: INSERT INTO harvest VALUES (43);
+step s0_many_subxacts: select subxacts();
+subxacts       
+
+               
+step s0_commit: COMMIT;
+step s2_checkpoint: CHECKPOINT;
+step s2_get_changes_suppress_output: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
'include-xids','0', 'skip-empty-xacts', '1') LIMIT 5;
 
+data           
+
+BEGIN          
+table public.harvest: INSERT: apples[integer]:41
+table public.harvest: INSERT: apples[integer]:42
+table public.harvest: INSERT: apples[integer]:42
+table public.harvest: INSERT: apples[integer]:42
+step s1_commit: COMMIT;
+step s2_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0',
'skip-empty-xacts','1');
 
+data           
+
+BEGIN          
+table public.harvest: INSERT: apples[integer]:43
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/subxact_without_top.spec
b/contrib/test_decoding/specs/subxact_without_top.spec
new file mode 100644
index 0000000000..1e78c2e8bf
--- /dev/null
+++ b/contrib/test_decoding/specs/subxact_without_top.spec
@@ -0,0 +1,65 @@
+# Forces decoding session to see (and associate with toplevel before commit)
+# subxact with previous top records (first xl_xact_assignment) beyond
+# restart_lsn. Such partially seen xact won't be streamed as it must finish
+# before confirmed_flush_lsn, but nevertheless this harmless schedule used to
+# cause
+#   ERROR: subtransaction logged without previous top-level txn record
+# failures.
+
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write
inxact
 
+    CREATE TABLE harvest(apples integer);
+    CREATE OR REPLACE FUNCTION subxacts() returns void as $$
+    BEGIN
+      FOR i in 1 .. 128 LOOP
+        BEGIN
+          INSERT INTO harvest VALUES (42);
+        EXCEPTION
+        WHEN OTHERS THEN
+      RAISE;
+        END;
+    END LOOP;
+    END; $$LANGUAGE 'plpgsql';
+}
+
+teardown
+{
+    DROP TABLE IF EXISTS harvest;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_begin" { BEGIN; }
+step "s0_first_subxact" {
+    DO LANGUAGE plpgsql $$
+      BEGIN
+        BEGIN
+          INSERT INTO harvest VALUES (41);
+    EXCEPTION WHEN OTHERS THEN RAISE;
+    END;
+      END $$;
+}
+step "s0_many_subxacts" { select subxacts(); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_begin" { BEGIN; }
+step "s1_dml" { INSERT INTO harvest VALUES (43); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+step "s2_checkpoint" { CHECKPOINT; }
+step "s2_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids',
'0','skip-empty-xacts', '1'); }
 
+step "s2_get_changes_suppress_output" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
'include-xids','0', 'skip-empty-xacts', '1') LIMIT 5; }
 
+
+# First s2_checkpoint serializes snapshot (creates potential restart_lsn point)
+# *after* initial subxact, i.e. first xl_xact_assignment is beyond it.
+# s0_many_subxacts forces second xl_xact_assignment (issues >
+# PGPROC_MAX_CACHED_SUBXIDS) to associate subxact with toplevel before commit.
+# Second s2_checkpoint ensures s2_get_changes directly following it will
+# advance the slot, establishing that restart_lsn.
+permutation "s0_begin" "s0_first_subxact" "s2_checkpoint" "s1_begin" "s1_dml" "s0_many_subxacts" "s0_commit"
"s2_checkpoint""s2_get_changes_suppress_output" "s1_commit" "s2_get_changes"
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index aeebbf243a..481277a1fd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -838,9 +838,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
     txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
     subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
 
-    if (new_top && !new_sub)
-        elog(ERROR, "subtransaction logged without previous top-level txn record");
-
     if (!new_sub)
     {
         if (rbtxn_is_known_subxact(subtxn))
-- 
2.11.0



-- cheers, arseny

Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Tue, Feb 11, 2020 at 10:02 AM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> >> Attached is raw version of isolation test triggering false
> >> 'subtransaction logged without...' (case (1)).
> >>
> >
> > This didn't reproduce the desired error for me (tried without a
> > patch).  I think you need to add two more steps ("s2_checkpoint"
> > "s2_get_changes") at the end of the test to set the restart_lsn at the
> > appropriate location.
>
> That's weird, it reliably fails with expected error for me. There are
> already two s2_checkpoint's: first establishes potential (broken)
> restart_lsn (serializes snapshot after first xl_xact_assignment of s0
> xact, but before first record of s1 xact), the second ensures
> s2_get_changes directly following it will actually advance the slot,
>

In my case, s2_get_changes doesn't seem to be advancing the restart
lsn because when it processed running_xact by s2_checkpoint, the slots
confirm flush location (slot->data.confirmed_flush) was behind it.  As
confirmed_flush was behind running_xact of s2_checkpoint, it couldn't
update slot->candidate_restart_lsn (in function
LogicalIncreaseRestartDecodingForSlot).  I think the confirmed_flush
location will only be updated at the end of get_changes.  This is the
reason I need extra get_changes call to generate an error.

I will think and investigate this more, but thought of sharing the
current situation with you.  There is something different going on in
my system or maybe the nature of test is like that.

> making that potential restart_lsn real.
>
> I don't see how adding s2_checkpoint and s2_get_changes helps here. Do
> they really provoke error in your setup?
>

Yes, I am running each of the steps in test manually by using three
different terminals.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

>> That's weird, it reliably fails with expected error for me. There are
>> already two s2_checkpoint's: first establishes potential (broken)
>> restart_lsn (serializes snapshot after first xl_xact_assignment of s0
>> xact, but before first record of s1 xact), the second ensures
>> s2_get_changes directly following it will actually advance the slot,
>>
>
> In my case, s2_get_changes doesn't seem to be advancing the restart
> lsn because when it processed running_xact by s2_checkpoint, the slots
> confirm flush location (slot->data.confirmed_flush) was behind it.  As
> confirmed_flush was behind running_xact of s2_checkpoint, it couldn't
> update slot->candidate_restart_lsn (in function
> LogicalIncreaseRestartDecodingForSlot).  I think the confirmed_flush
> location will only be updated at the end of get_changes.  This is the
> reason I need extra get_changes call to generate an error.
>
> I will think and investigate this more, but thought of sharing the
> current situation with you.  There is something different going on in
> my system or maybe the nature of test is like that.

Ah, I think I know what's happening -- you have one more
xl_running_xacts which catches the advancement -- similar issue is
explained in the comment in oldest_xmin.spec.

Try attached.

From bd9e50d919cc07ec30f06c443d142f5f8510887c Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Wed, 23 Oct 2019 15:56:46 +0300
Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact
 in decoding.

Manifested as

ERROR:  subtransaction logged without previous top-level txn record

this check forbids legit behaviours like
 - First xl_xact_assignment record is beyond reading, i.e. earlier
   restart_lsn.
 - After restart_lsn there is some change of a subxact.
 - After that, there is second xl_xact_assignment (for another subxact)
   revealing relationship between top and first subxact.

Such transaction won't be streamed anyway because we hadn't seen it in full;
confirmed_flush_lsn must be past all these records. Saying for sure whether xact
of some record encountered after snapshot was deserialized can be streamed or
not requires to know whether it wrote something before deserialization point --
if yes, it hasn't been seen in full and can't be decoded. Snapshot doesn't have
such info, so there is no easy way to relax the check.
---
 contrib/test_decoding/Makefile                     |  2 +-
 .../test_decoding/expected/subxact_without_top.out | 39 +++++++++++++
 .../test_decoding/specs/subxact_without_top.spec   | 67 ++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |  3 -
 4 files changed, 107 insertions(+), 4 deletions(-)
 create mode 100644 contrib/test_decoding/expected/subxact_without_top.out
 create mode 100644 contrib/test_decoding/specs/subxact_without_top.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 4afb1d963e..f439c582a5 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
     decoding_into_rel binary prepared replorigin time messages \
     spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-    oldest_xmin snapshot_transfer
+    oldest_xmin snapshot_transfer subxact_without_top
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/subxact_without_top.out
b/contrib/test_decoding/expected/subxact_without_top.out
new file mode 100644
index 0000000000..99ce998822
--- /dev/null
+++ b/contrib/test_decoding/expected/subxact_without_top.out
@@ -0,0 +1,39 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s0_begin s0_first_subxact s2_checkpoint s1_begin s1_dml s0_many_subxacts s0_commit s2_checkpoint
s2_get_changes_suppress_outputs2_get_changes_suppress_output s1_commit s2_get_changes
 
+step s0_begin: BEGIN;
+step s0_first_subxact: 
+    DO LANGUAGE plpgsql $$
+      BEGIN
+        BEGIN
+          INSERT INTO harvest VALUES (41);
+    EXCEPTION WHEN OTHERS THEN RAISE;
+    END;
+      END $$;
+
+step s2_checkpoint: CHECKPOINT;
+step s1_begin: BEGIN;
+step s1_dml: INSERT INTO harvest VALUES (43);
+step s0_many_subxacts: select subxacts();
+subxacts       
+
+               
+step s0_commit: COMMIT;
+step s2_checkpoint: CHECKPOINT;
+step s2_get_changes_suppress_output: SELECT null n FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
'include-xids','0', 'skip-empty-xacts', '1') GROUP BY n;
 
+n              
+
+               
+step s2_get_changes_suppress_output: SELECT null n FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
'include-xids','0', 'skip-empty-xacts', '1') GROUP BY n;
 
+n              
+
+step s1_commit: COMMIT;
+step s2_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0',
'skip-empty-xacts','1');
 
+data           
+
+BEGIN          
+table public.harvest: INSERT: apples[integer]:43
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/subxact_without_top.spec
b/contrib/test_decoding/specs/subxact_without_top.spec
new file mode 100644
index 0000000000..6e9984da62
--- /dev/null
+++ b/contrib/test_decoding/specs/subxact_without_top.spec
@@ -0,0 +1,67 @@
+# Forces decoding session to see (and associate with toplevel before commit)
+# subxact with previous top records (first xl_xact_assignment) beyond
+# restart_lsn. Such partially seen xact won't be streamed as it must finish
+# before confirmed_flush_lsn, but nevertheless this harmless schedule used to
+# cause
+#   ERROR: subtransaction logged without previous top-level txn record
+# failures.
+
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write
inxact
 
+    CREATE TABLE harvest(apples integer);
+    CREATE OR REPLACE FUNCTION subxacts() returns void as $$
+    BEGIN
+      FOR i in 1 .. 128 LOOP
+        BEGIN
+          INSERT INTO harvest VALUES (42);
+        EXCEPTION
+        WHEN OTHERS THEN
+      RAISE;
+        END;
+    END LOOP;
+    END; $$LANGUAGE 'plpgsql';
+}
+
+teardown
+{
+    DROP TABLE IF EXISTS harvest;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_begin" { BEGIN; }
+step "s0_first_subxact" {
+    DO LANGUAGE plpgsql $$
+      BEGIN
+        BEGIN
+          INSERT INTO harvest VALUES (41);
+    EXCEPTION WHEN OTHERS THEN RAISE;
+    END;
+      END $$;
+}
+step "s0_many_subxacts" { select subxacts(); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_begin" { BEGIN; }
+step "s1_dml" { INSERT INTO harvest VALUES (43); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+step "s2_checkpoint" { CHECKPOINT; }
+step "s2_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids',
'0','skip-empty-xacts', '1'); }
 
+step "s2_get_changes_suppress_output" { SELECT null n FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
'include-xids','0', 'skip-empty-xacts', '1') GROUP BY n; }
 
+
+# First s2_checkpoint serializes snapshot (creates potential restart_lsn point)
+# *after* initial subxact, i.e. first xl_xact_assignment is beyond it.
+# s0_many_subxacts forces second xl_xact_assignment (issues >
+# PGPROC_MAX_CACHED_SUBXIDS) to associate subxact with toplevel before commit.
+# Second s2_checkpoint ensures s2_get_changes directly following it will
+# advance the slot, establishing that restart_lsn.
+# We do get_changes twice because if one more xl_running_xacts record had slipped
+# before our CHECKPOINT, confirmed_flush will be advanced only to that record.
+permutation "s0_begin" "s0_first_subxact" "s2_checkpoint" "s1_begin" "s1_dml" "s0_many_subxacts" "s0_commit"
"s2_checkpoint""s2_get_changes_suppress_output" "s2_get_changes_suppress_output" "s1_commit" "s2_get_changes"
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index aeebbf243a..481277a1fd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -838,9 +838,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
     txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
     subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
 
-    if (new_top && !new_sub)
-        elog(ERROR, "subtransaction logged without previous top-level txn record");
-
     if (!new_sub)
     {
         if (rbtxn_is_known_subxact(subtxn))
-- 
2.11.0



-- cheers, arseny

Re: ERROR: subtransaction logged without previous top-level txn record

From
Dilip Kumar
Date:
On Tue, Feb 11, 2020 at 5:36 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> >> That's weird, it reliably fails with expected error for me. There are
> >> already two s2_checkpoint's: first establishes potential (broken)
> >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0
> >> xact, but before first record of s1 xact), the second ensures
> >> s2_get_changes directly following it will actually advance the slot,
> >>
> >
> > In my case, s2_get_changes doesn't seem to be advancing the restart
> > lsn because when it processed running_xact by s2_checkpoint, the slots
> > confirm flush location (slot->data.confirmed_flush) was behind it.  As
> > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't
> > update slot->candidate_restart_lsn (in function
> > LogicalIncreaseRestartDecodingForSlot).  I think the confirmed_flush
> > location will only be updated at the end of get_changes.  This is the
> > reason I need extra get_changes call to generate an error.
> >
> > I will think and investigate this more, but thought of sharing the
> > current situation with you.  There is something different going on in
> > my system or maybe the nature of test is like that.
>
> Ah, I think I know what's happening -- you have one more
> xl_running_xacts which catches the advancement -- similar issue is
> explained in the comment in oldest_xmin.spec.
>
There is one more inconsistency in the test case which I faced while
trying to reproduce.  The problem is that, after "s0_begin"
"s0_first_subxact", steps the open transaction is the top-transaction
because we have generated the sub-transaction and closed it.  Now,
during the "s0_many_subxacts" step, while scanning the system table
(e.g. for finding the function) the top-transaction might log the WAL
for the hint bits.  And then we will lose the purpose of the test
because we will get the WAL for the top-transaction, after the restart
point and then there will be no error.  For fixing this I have
modified the "s0_first_subxact" as shown below

+step "s0_first_subxact" {
+    DO LANGUAGE plpgsql $$
+      BEGIN
+        BEGIN
+          INSERT INTO harvest VALUES (41);
+ EXCEPTION WHEN OTHERS THEN RAISE;
+ END;
+      END $$;
+}
savepoint s1;   -- added extra
INSERT INTO harvest VALUES (41);  --added extra

Basically, after these two steps, the open transaction will be the
sub-transaction, not the top-transaction and that will make sure that
even if the future steps log some WAL then those will be under the
sub-transaction, not the top-transaction.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Wed, Feb 12, 2020 at 8:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Feb 11, 2020 at 5:36 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
> >
> >
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> >
> > >> That's weird, it reliably fails with expected error for me. There are
> > >> already two s2_checkpoint's: first establishes potential (broken)
> > >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0
> > >> xact, but before first record of s1 xact), the second ensures
> > >> s2_get_changes directly following it will actually advance the slot,
> > >>
> > >
> > > In my case, s2_get_changes doesn't seem to be advancing the restart
> > > lsn because when it processed running_xact by s2_checkpoint, the slots
> > > confirm flush location (slot->data.confirmed_flush) was behind it.  As
> > > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't
> > > update slot->candidate_restart_lsn (in function
> > > LogicalIncreaseRestartDecodingForSlot).  I think the confirmed_flush
> > > location will only be updated at the end of get_changes.  This is the
> > > reason I need extra get_changes call to generate an error.
> > >
> > > I will think and investigate this more, but thought of sharing the
> > > current situation with you.  There is something different going on in
> > > my system or maybe the nature of test is like that.
> >
> > Ah, I think I know what's happening -- you have one more
> > xl_running_xacts which catches the advancement -- similar issue is
> > explained in the comment in oldest_xmin.spec.
> >

Right, that is why in my case get_changes were required twice.  After
calling get_changes as we do in oldest_xmin.spec will make test case
reliable.

> There is one more inconsistency in the test case which I faced while
> trying to reproduce.  The problem is that, after "s0_begin"
> "s0_first_subxact", steps the open transaction is the top-transaction
> because we have generated the sub-transaction and closed it.  Now,
> during the "s0_many_subxacts" step, while scanning the system table
> (e.g. for finding the function) the top-transaction might log the WAL
> for the hint bits.
>

I am curious to know how this is happening in your case?  Because we
log WAL for hint-bits only when checksums or wal_log_hints are enabled
(See (or XLogHintBitIsNeeded) which is not the default case?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Dilip Kumar
Date:
On Wed, Feb 12, 2020 at 9:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 12, 2020 at 8:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Feb 11, 2020 at 5:36 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
> > >
> > >
> > > Amit Kapila <amit.kapila16@gmail.com> writes:
> > >
> > > >> That's weird, it reliably fails with expected error for me. There are
> > > >> already two s2_checkpoint's: first establishes potential (broken)
> > > >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0
> > > >> xact, but before first record of s1 xact), the second ensures
> > > >> s2_get_changes directly following it will actually advance the slot,
> > > >>
> > > >
> > > > In my case, s2_get_changes doesn't seem to be advancing the restart
> > > > lsn because when it processed running_xact by s2_checkpoint, the slots
> > > > confirm flush location (slot->data.confirmed_flush) was behind it.  As
> > > > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't
> > > > update slot->candidate_restart_lsn (in function
> > > > LogicalIncreaseRestartDecodingForSlot).  I think the confirmed_flush
> > > > location will only be updated at the end of get_changes.  This is the
> > > > reason I need extra get_changes call to generate an error.
> > > >
> > > > I will think and investigate this more, but thought of sharing the
> > > > current situation with you.  There is something different going on in
> > > > my system or maybe the nature of test is like that.
> > >
> > > Ah, I think I know what's happening -- you have one more
> > > xl_running_xacts which catches the advancement -- similar issue is
> > > explained in the comment in oldest_xmin.spec.
> > >
>
> Right, that is why in my case get_changes were required twice.  After
> calling get_changes as we do in oldest_xmin.spec will make test case
> reliable.
>
> > There is one more inconsistency in the test case which I faced while
> > trying to reproduce.  The problem is that, after "s0_begin"
> > "s0_first_subxact", steps the open transaction is the top-transaction
> > because we have generated the sub-transaction and closed it.  Now,
> > during the "s0_many_subxacts" step, while scanning the system table
> > (e.g. for finding the function) the top-transaction might log the WAL
> > for the hint bits.
> >
>
> I am curious to know how this is happening in your case?  Because we
> log WAL for hint-bits only when checksums or wal_log_hints are enabled
> (See (or XLogHintBitIsNeeded) which is not the default case?

Yeah, you are right.  Actually, wal_log_hints is set in my
configuration.  So it should not be a problem.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Wed, Feb 12, 2020 at 9:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 12, 2020 at 8:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Feb 11, 2020 at 5:36 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
> > >
> > >
> > > Amit Kapila <amit.kapila16@gmail.com> writes:
> > >
> > > >> That's weird, it reliably fails with expected error for me. There are
> > > >> already two s2_checkpoint's: first establishes potential (broken)
> > > >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0
> > > >> xact, but before first record of s1 xact), the second ensures
> > > >> s2_get_changes directly following it will actually advance the slot,
> > > >>
> > > >
> > > > In my case, s2_get_changes doesn't seem to be advancing the restart
> > > > lsn because when it processed running_xact by s2_checkpoint, the slots
> > > > confirm flush location (slot->data.confirmed_flush) was behind it.  As
> > > > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't
> > > > update slot->candidate_restart_lsn (in function
> > > > LogicalIncreaseRestartDecodingForSlot).  I think the confirmed_flush
> > > > location will only be updated at the end of get_changes.  This is the
> > > > reason I need extra get_changes call to generate an error.
> > > >
> > > > I will think and investigate this more, but thought of sharing the
> > > > current situation with you.  There is something different going on in
> > > > my system or maybe the nature of test is like that.
> > >
> > > Ah, I think I know what's happening -- you have one more
> > > xl_running_xacts which catches the advancement -- similar issue is
> > > explained in the comment in oldest_xmin.spec.
> > >
>
> Right, that is why in my case get_changes were required twice.  After
> calling get_changes as we do in oldest_xmin.spec will make test case
> reliable.
>

Attached is a patch where I have modified the comments and slightly
edited the commit message.  This patch was not getting applied in v11
and branches lower than that, so I prepared a patch for those branches
as well. I have tested this patch till 9.5 and it works as intended.

Can you also once check the patch and verify it in back-branches?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: ERROR: subtransaction logged without previous top-level txn record

From
Dilip Kumar
Date:
On Wed, Feb 12, 2020 at 1:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 12, 2020 at 9:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Feb 12, 2020 at 8:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Feb 11, 2020 at 5:36 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
> > > >
> > > >
> > > > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > >
> > > > >> That's weird, it reliably fails with expected error for me. There are
> > > > >> already two s2_checkpoint's: first establishes potential (broken)
> > > > >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0
> > > > >> xact, but before first record of s1 xact), the second ensures
> > > > >> s2_get_changes directly following it will actually advance the slot,
> > > > >>
> > > > >
> > > > > In my case, s2_get_changes doesn't seem to be advancing the restart
> > > > > lsn because when it processed running_xact by s2_checkpoint, the slots
> > > > > confirm flush location (slot->data.confirmed_flush) was behind it.  As
> > > > > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't
> > > > > update slot->candidate_restart_lsn (in function
> > > > > LogicalIncreaseRestartDecodingForSlot).  I think the confirmed_flush
> > > > > location will only be updated at the end of get_changes.  This is the
> > > > > reason I need extra get_changes call to generate an error.
> > > > >
> > > > > I will think and investigate this more, but thought of sharing the
> > > > > current situation with you.  There is something different going on in
> > > > > my system or maybe the nature of test is like that.
> > > >
> > > > Ah, I think I know what's happening -- you have one more
> > > > xl_running_xacts which catches the advancement -- similar issue is
> > > > explained in the comment in oldest_xmin.spec.
> > > >
> >
> > Right, that is why in my case get_changes were required twice.  After
> > calling get_changes as we do in oldest_xmin.spec will make test case
> > reliable.
> >
>
> Attached is a patch where I have modified the comments and slightly
> edited the commit message.  This patch was not getting applied in v11
> and branches lower than that, so I prepared a patch for those branches
> as well. I have tested this patch till 9.5 and it works as intended.
>
> Can you also once check the patch and verify it in back-branches?

I have checked the patch and it looks fine to me.  I have also tested
it on the back branches and it works fine.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Thu, Feb 13, 2020 at 10:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Feb 12, 2020 at 1:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Attached is a patch where I have modified the comments and slightly
> > edited the commit message.  This patch was not getting applied in v11
> > and branches lower than that, so I prepared a patch for those branches
> > as well. I have tested this patch till 9.5 and it works as intended.
> >
> > Can you also once check the patch and verify it in back-branches?
>
> I have checked the patch and it looks fine to me.  I have also tested
> it on the back branches and it works fine.
>

Thanks,  I am planning to commit this next week sometime (on Wednesday
19-Feb) to let others review or share their opinion.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

> Attached is a patch where I have modified the comments and slightly
> edited the commit message.  This patch was not getting applied in v11
> and branches lower than that, so I prepared a patch for those branches
> as well. I have tested this patch till 9.5 and it works as intended.
>
> Can you also once check the patch and verify it in back-branches?

Sorry for the delay. I'm fine with changes. make check down to 9.5 is
also happy here.

Interesting thing about wal_log_hints, hasn't occured to me.


-- cheers, arseny



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Fri, Feb 14, 2020 at 7:04 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> > Attached is a patch where I have modified the comments and slightly
> > edited the commit message.  This patch was not getting applied in v11
> > and branches lower than that, so I prepared a patch for those branches
> > as well. I have tested this patch till 9.5 and it works as intended.
> >
> > Can you also once check the patch and verify it in back-branches?
>
> Sorry for the delay. I'm fine with changes. make check down to 9.5 is
> also happy here.
>

Pushed, let's keep an eye on buildfarm.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Sun, Feb 9, 2020 at 9:37 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
>
> Somehow I hadn't realized this earlier, so my comments/commit messages
> in patches above were not accurate here; I've edited them. Also in the
> first patch serialized snapshots are not no longer used for new slot
> creation at all, as Andres suggested above.
>

+ /*
+ * Don't use serialized snapshot if we are not sure where all
+ * currently running xacts will finish (new slot creation).
+ * (Actually, if we came here through xl_running_xacts, we could perform
+ * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly,
+ * but added lines of code would hardly worth the benefit.)
+ */
+ if (builder->start_decoding_at == InvalidXLogRecPtr)
+ return false;

Instead of using start_decoding_at to decide whether to restore
snapshot or not, won't it be better to have new variable in SnapBuild
(say can_use_serialized_snap or something like that) and for this
purpose?

I think the patch is trying to use a variable that is not meant for
the purpose we are using for it, so not sure if it is the right
direction for the fix.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

> On Sun, Feb 9, 2020 at 9:37 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
> + /*
> + * Don't use serialized snapshot if we are not sure where all
> + * currently running xacts will finish (new slot creation).
> + * (Actually, if we came here through xl_running_xacts, we could perform
> + * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly,
> + * but added lines of code would hardly worth the benefit.)
> + */
> + if (builder->start_decoding_at == InvalidXLogRecPtr)
> + return false;
>
> Instead of using start_decoding_at to decide whether to restore
> snapshot or not, won't it be better to have new variable in SnapBuild
> (say can_use_serialized_snap or something like that) and for this
> purpose?

start_decoding_at who is initialized externally at
AllocateSnapshotBuilder is what actually defines how to handle
serialized snapshots: if it is valid LSN, snapbuild must trust the
caller that WAL reading starts early enough to stream since this LSN, so
we deserialize the snap and jump into CONSISTENT. If it is invalid, we
don't know the safe streaming point yet, and it remains invalid until we
learn full snapshot and then wait for all xacts finishing. So such bool
would be a pointless synonym.

Moreover, as cited comment mentions:

> + * (Actually, if we came here through xl_running_xacts, we could perform
> + * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly,
> + * but added lines of code would hardly worth the benefit.)

there is nothing wrong in using the serialized snapshot per se. It's
just that we must wait for all xacts finishing after getting the
snapshot and this is impossible if we don't know who is running. So
can_use_serialized_snap would be even somewhat confusing.


-- cheers, arseny



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Mon, Mar 2, 2020 at 1:11 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> > On Sun, Feb 9, 2020 at 9:37 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
> > + /*
> > + * Don't use serialized snapshot if we are not sure where all
> > + * currently running xacts will finish (new slot creation).
> > + * (Actually, if we came here through xl_running_xacts, we could perform
> > + * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly,
> > + * but added lines of code would hardly worth the benefit.)
> > + */
> > + if (builder->start_decoding_at == InvalidXLogRecPtr)
> > + return false;
> >
> > Instead of using start_decoding_at to decide whether to restore
> > snapshot or not, won't it be better to have new variable in SnapBuild
> > (say can_use_serialized_snap or something like that) and for this
> > purpose?
>
> start_decoding_at who is initialized externally at
> AllocateSnapshotBuilder is what actually defines how to handle
> serialized snapshots: if it is valid LSN, snapbuild must trust the
> caller that WAL reading starts early enough to stream since this LSN, so
> we deserialize the snap and jump into CONSISTENT. If it is invalid, we
> don't know the safe streaming point yet, and it remains invalid until we
> learn full snapshot and then wait for all xacts finishing.
>

I think here you are trying to deduce the meaning.  I don't see that
it can clearly define that don't use serialized snapshots. It is not
clear to me why have you changed the below code, basically why it is
okay to pass InvalidXLogRecPtr instead of restart_lsn?

@@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin,
  ReplicationSlotMarkDirty();
  ReplicationSlotSave();

- ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
+ ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon,
  need_full_snapshot, false,
  read_page, prepare_write, do_write,
  update_progress);

> So such bool
> would be a pointless synonym.
>
> Moreover, as cited comment mentions:
>
> > + * (Actually, if we came here through xl_running_xacts, we could perform
> > + * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly,
> > + * but added lines of code would hardly worth the benefit.)
>
> there is nothing wrong in using the serialized snapshot per se. It's
> just that we must wait for all xacts finishing after getting the
> snapshot and this is impossible if we don't know who is running.
>

I am not denying any such possibility and even if we do something like
that it will be for the master branch.

> So
> can_use_serialized_snap would be even somewhat confusing.
>

It is possible that your idea of trying to deduce from
start_decoding_at is better, but I am not sure about it if anyone else
also thinks that is a good way to do, then fine.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

> I think here you are trying to deduce the meaning.  I don't see that
> it can clearly define that don't use serialized snapshots. It is not
> clear to me why have you changed the below code, basically why it is
> okay to pass InvalidXLogRecPtr instead of restart_lsn?
>
> @@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin,
>   ReplicationSlotMarkDirty();
>   ReplicationSlotSave();
>
> - ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
> + ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon,
>   need_full_snapshot, false,
>   read_page, prepare_write, do_write,
>   update_progress);

Because when we create the slot we don't demand to stream from some
specific point. In fact we just can't, because we don't know since which
LSN it is actually possible to stream, i.e. when we'd have good snapshot
and no old (which we haven't seen in full) xacts running. It is up to
snapbuild.c to define this point. The previous coding was meaningless:
we asked for some random restart_lsn and snapbuild.c would silently
advance it to earliest suitable LSN.

OTOH, when we are decoding from existing slot not only we know earliest
possible point, but to avoid missing xacts we must enforce streaming
since this very point despite the snapbuilder being unable (because he
might not know which xacts are running at point of the snapshot) to
check its safety.

start_decoding_at reflects the difference between these scenarios, and
serialized snapshots handling stems from here.

Thanks for looking into this.


-- cheers, arseny



Re: ERROR: subtransaction logged without previous top-level txn record

From
Amit Kapila
Date:
On Mon, Mar 2, 2020 at 7:07 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> > I think here you are trying to deduce the meaning.  I don't see that
> > it can clearly define that don't use serialized snapshots. It is not
> > clear to me why have you changed the below code, basically why it is
> > okay to pass InvalidXLogRecPtr instead of restart_lsn?
> >
> > @@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin,
> >   ReplicationSlotMarkDirty();
> >   ReplicationSlotSave();
> >
> > - ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
> > + ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon,
> >   need_full_snapshot, false,
> >   read_page, prepare_write, do_write,
> >   update_progress);
>
> Because when we create the slot we don't demand to stream from some
> specific point. In fact we just can't, because we don't know since which
> LSN it is actually possible to stream, i.e. when we'd have good snapshot
> and no old (which we haven't seen in full) xacts running. It is up to
> snapbuild.c to define this point. The previous coding was meaningless:
> we asked for some random restart_lsn and snapbuild.c would silently
> advance it to earliest suitable LSN.
>

Hmm, if this is the case then it should be true even without solving
this particular problem and we should be able to make this change.
Leaving that aside, I think this change can make copy replication slot
functionality to also skip using serialized snapshots with this patch
which is not our intention.  Also, it doesn't seem like a good idea to
ignore setting start_decoding_at when we already set
slot->data.restart_lsn with this value.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ERROR: subtransaction logged without previous top-level txn record

From
Arseny Sher
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:

>> Because when we create the slot we don't demand to stream from some
>> specific point. In fact we just can't, because we don't know since which
>> LSN it is actually possible to stream, i.e. when we'd have good snapshot
>> and no old (which we haven't seen in full) xacts running. It is up to
>> snapbuild.c to define this point. The previous coding was meaningless:
>> we asked for some random restart_lsn and snapbuild.c would silently
>> advance it to earliest suitable LSN.
>>
>
> Hmm, if this is the case then it should be true even without solving
> this particular problem and we should be able to make this change.

Right.

> Leaving that aside, I think this change can make copy replication slot
> functionality to also skip using serialized snapshots with this patch
> which is not our intention.

As I say at [1] logical slot copying facility is currently anyway broken
in this regard: restart_lsn is copied, but confirmed_flush isn't, and
the right fix, in my view, is to avoid DecodingContextFindStartpoint
there altogether (by checking donor's confirmed_flush is valid and
copying it) which would render this irrelevant. To speculate, even if
wanted to go through DecodingContextFindStartpoint for slot copying and
establish confirmed_flush on our own, surely we'd need to handle
serialized snapshots exactly as new slot creation does because dangers
of getting SNAPBUILD_CONSISTENT too early are the same in both cases.

> Also, it doesn't seem like a good idea to ignore setting
> start_decoding_at when we already set slot->data.restart_lsn with this
> value.

Well, these two fields have absolutely different values. BTW I find the
naming here somewhat unfortunate, and this phrase suggests that it
indeed leads to confusion.

Slot's restart_lsn is the LSN since which we start reading WAL and by
setting data.restart_lsn we prevent WAL we need from
recycling. start_decoding_at is the LSN since which we start
*streaming*, i.e. actually replaying COMMITs. So setting the first one
(as we must hold WAL) and not the second one (as we don't know the
streaming point yet when we start slot creation) is just fine.


[1]
https://www.postgresql.org/message-id/flat/CA%2Bfd4k70BXLTm-N6q18LrL%3DGbKtwY3-2%2B%2BUVFw05SvFTkZgTyQ%40mail.gmail.com


-- cheers, arseny