Thread: ERROR: subtransaction logged without previous top-level txn record
Hello, I saw the previous thread but it wasn't in my inbox at the time, so I’m creating a new one sorry about that. https://www.postgresql.org/message-id/20190516170434.masck6ehwg2kvbi2@alap3.anarazel.de I’ve managed to reproduce the issue pretty consistently on REL9_6_STABLE on commit 959792087a10baf7f1b58408d28411109bcedb7a OS version: [ec2-user@ ... ~]$ uname -a ... 4.14.77-80.57.amzn2.x86_64 #1 SMP Tue Nov 6 21:18:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Postgres version: postgres=# SELECT version(); version ---------------------------------------------------------------------------------------------------------- PostgreSQL 9.6.13 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5), 64-bit (1 row) I’m on an EC2 m5.4xlarge if that matters. Repro steps: 1. Create the following script [ec2-user@ip-172-31-18-48 ~]$ cat ~/subbench.txt \set aid random(1, 100000 * :scale) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) \set subcnt random(1, 800) select * from pgbench(:aid, :bid, :tid, :delta, :subcnt); 2. Create the following function: CREATE OR REPLACE FUNCTION pgbench(pAid int, pBid int, pTid int, delta int, subcnt int) returns int as $$ DECLARE abal int; BEGIN FOR i in 1 .. subcnt LOOP BEGIN UPDATE pgbench_accounts SET abalance = abalance + delta WHERE aid = pAid; --subcnt := subcnt; EXCEPTION WHEN division_by_zero THEN subcnt := subcnt; END; END LOOP; abal := abalance FROM pgbench_accounts WHERE aid = pAid; return abal; END; $$LANGUAGE 'plpgsql'; 3. Create a few logical slots in the database select pg_create_logical_replication_slot('test_slot_1', 'test_decoding'); select pg_create_logical_replication_slot('test_slot_2', 'test_decoding'); select pg_create_logical_replication_slot('test_slot_3', 'test_decoding'); ... 4. Initialize pgbench pgbench -i -d postgres 5. Load the data pgbench -f subbench.txt -c 64 -j 64 -T 600 -P 1 -d postgres 6. Run pg_recvlogical with a timeout, it usually takes a few iterations (~7-9) before the error occurs var=0 while true; do timeout 30 pg_recvlogical -d postgres --start --slot test_slot_1 -f /dev/null; var=$((var+1)) echo "Sleeping 5s Time: $var"; sleep 5; done pg_recvlogical -d postgres --start --slot test_slot_1 -f - pg_recvlogical: unexpected termination of replication stream: ERROR: subtransaction logged without previous top-level txnrecord pg_recvlogical -d postgres --start --slot test_slot_2 -f - pg_recvlogical: unexpected termination of replication stream: ERROR: subtransaction logged without previous top-level txnrecord pg_recvlogical -d postgres --start --slot test_slot_3 -f - pg_recvlogical: unexpected termination of replication stream: ERROR: subtransaction logged without previous top-level txnrecord pg_recvlogical: disconnected; waiting 5 seconds to try again What's interesting is that the confirmed_flush_lsn are all different from test_slot_1 --> test_slot_3 postgres=# select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn |confirmed_flush_lsn --------------+---------------+-----------+--------+----------+--------+------------+------+--------------+-------------+--------------------- test_slot_1 | test_decoding | logical | 13382 | postgres | f | | | 1848 | 0/1C5BC5A0 |0/5488E468 test_slot_2 | test_decoding | logical | 13382 | postgres | f | | | 1848 | 0/1C5BC5A0 |0/40E45EA0 test_slot_3 | test_decoding | logical | 13382 | postgres | f | | | 1848 | 0/3F4B6AF8 |0/6BB3A990 Let me know if you require more info to repro. Thanks! John H
Hi, Our customer also encountered this issue and I've looked into it. The problem is reproduced well enough using the instructions in the previous message. The check leading to this ERROR is too strict, it forbids legit behaviours. Say we have in WAL [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ] - First xl_xact_assignment record is beyond reading, i.e. earlier restart_lsn, where ready snapshot will be taken from disk. - After restart_lsn there is some change of a subxact. - After that, there is second xl_xact_assignment (for another subxact) revealing relationship between top and first subxact, where this ERROR fires. Such transaction won't be streamed because we hadn't seen it in full. It must be finished before streaming will start, i.e. before confirmed_flush_lsn. Of course, the easiest fix is to just throw the check out. However, supposing that someone would probably want to relax it instead, I considered ways to accomplish this. Something like 'if we are still in SNAPSHOT_FULL and xid is before SnapBuildNextPhaseAt, just ignore xl_xact_assignment record, we haven't seen such xact in full and definitely won't stream it.' That led to discovery of another bug in the place which I had found suspicious long before. Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not just complete snapshot (snapshot by itself in FULL state is just good as in CONSISTENT), but also reorderbuffer filled with all currently running xacts. This is painless for decoding sessions with existing slots because they won't stream anything before confirmed_flush_lsn is reached anyway, at which point all transactions which hadn't got into reorderbuffer would definitely finish. However, new slots might be created too early, thus losing (not decoding) parts of transactions committed after freshly created confirmed_flush_lsn. This can happen under the following extremely unlucky circumstances: - New slot creation reserves point in WAL since which it would read it (GetXLogInsertRecPtr); - It logs xl_running_xacts to start assembling a snapshot; - Running decoding session with another slot quickly reads this xl_running_xacts and serializes its snapshot; - New slot reads xl_running_xacts and picks this snapshot up, saying that it is ready to stream henceforth, though its reorderbuffer is empty. Exact reproducing steps: -- session 1 create table t (i int); select pg_create_logical_replication_slot('slot_1', 'test_decoding'); -- session 2 begin; insert into t values (1); -- session 3, start slot creation select pg_create_logical_replication_slot('slot_2', 'test_decoding'); -- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx); -- session 1 -- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no -- need to sleep; our snap will be immediately serialized there SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0'); -- continue slot_2 creation -- session 2: insert some more and commit insert into t values (1); commit; -- now this would find second insert, but not the first one SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0'); What we can do here? Initially I was like, ok, then let's get into FULL_SNAPSHOT upon deserializing the snap and wait for all xacts finish as usual. However, to my surprise I've found this impossible. That is, snapbuilder has no way to enforce that we go into CONSISTENT only when we have seen all running xacts completely without risk of skipping legit transactions. Specifically, after deserializing FULL snapshot snapbuilder must iterate over WAL further until all running xacts finish, as we must see with correct snapshots all changes of every transaction we are going to stream. However, snapbuilder can't *immediately* notice this point, because - Snapbuilder updates xmin (first running xact) by taking it from xl_running_xacts (c.f. SnapBuildProcessRunningXacts). Even if we guarantee that, for each possible WAL reading starting position, there is always an an xl_running_xacts records logged right before the earliest possible streaming point -- IOW, after all xacts which we can't stream had finished (which is currently true btw, as slot's advancement is considered only at xl_running_xacts) -- that would not be enough due to races around xl_running_xacts, i.e with WAL like [ <T1> <restart_lsn> <T1 commit> <confirmed_flush_lsn, xrx> <T2 commit> ] T2 might be skipped if T1 is shown as running in xl_running_xacts. - Tracking xmin manually by recoding commits is not only inefficient, it just not feasible because serialized snapshot is not full: it contains only committed catalog-modifying xacts. Thus, we can't distinguish non-catalog-modifying xact committed before serialized snapshot from not yet committed one. Which means only code external to snapbuilder knows the earliest point suitable for streaming; slot advancement machinery ensures that <restart_lsn, confirmed_flush_lsn> pair is always good. So possible fix is the following: if snapbuilder's user knows exact LSN since which streaming is safe (existing slot, essentially), trust him and and switch into CONSISTENT state after deserializing snapshot as before. OTOH, if he doesn't know it (new slot creation), go via usual FULL -> CONSISTENT procedure; we might transition into CONSISTENT a bit later than it became possible, but there is nothing bad about that. First attached patch implements this. I don't particularly like it, but the only alternative which I see is to rework slots advancement logic to make <restart_lsn, confirmed_flush_lsn> pair such that there is always xl_running_xacts before confirmed_flush_lsn which confirms all xacts running as of restart_lsn have finished. This unnecessary complexity looks much worse. As for the check in the topic, I nonetheless propose to remove it completely, as in second attached patch. Saying for sure whether xact of some record encountered after snapshot was deserialized can be streamed or not requires to know nextXid (first not yet running xid) as of this snapshot's lsn -- all xids < nextXid possibly hadn't been seen in full and are not subject to decoding. However, generally we don't know nextXid which is taken from xl_running_xacts; in particular snapshot can be serizalized/deserialized at XLOG_END_OF_RECOVERY. Changing that for the sake of the check in question is not worthwhile, so just throw it out instead of trying to relax. In fact, I don't see what is so important about seeing the top xact first at all. During COMMIT decoding we'll know all subxids anyway, so why care? P.S. While digging this, I have noted that return values of SnapBuildFindSnapshot seem pretty random to me. Basically returning 'true' performs immediately 4 things: - update xmin - purge old xip entries - advance xmin of the slot - if CONSISTENT, advance lsn (earliest serialized snap) The latter two make sense only after slot created or confirmed_flush_lsn reached. The first two make sense even immediately after deserializing the snapshot (because it is serialized *before* updating xmin and xip); generally, always when committed xids are tracked. Then why cleanup is done when xmin horizon is too low? Why it is not performed after restoring serialized snapshot? On the whole, I find this not very important as all these operations are pretty cheap and harmless if executed too early -- it would be simpler just do them always. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company From 86a03671c3bdbaee3dabe02a9a1f8e2e23098652 Mon Sep 17 00:00:00 2001 From: Arseny Sher <sher-ars@yandex.ru> Date: Tue, 22 Oct 2019 19:02:14 +0300 Subject: [PATCH 1/2] Fix serialized snapshot usage for new logical slots. Previously, snapbuilder entered into SNAPBUILD_CONSISTENT immediately after deserializing the snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not just complete snapshot, but also reorderbuffer filled with all currently running xacts. This is painless for decoding sessions with existing slots because they won't stream anything before confirmed_flush_lsn is reached anyway, at which point all transactions which hadn't got into reorderbuffer would definitely finish. However, new slots might be created too early, thus losing (not decoding) parts of transactions committed after freshly created confirmed_flush_lsn. This can happen under the following extremely unlucky circumstances: - New slot creation reserves point in WAL since which it would read it (GetXLogInsertRecPtr); - It logs xl_running_xacts to start assembling a snapshot; - Running decoding session with another slot quickly reads this xl_running_xacts and serializes its snapshot; - New slot reads xl_running_xacts and picks this snapshot up, saying that it is ready to stream henceforth, though its reorderbuffer is empty. It turns out, as comment to AllocateSnapshotBuilder explains, that snapbuilder can't locate *earliest* possible point for streaming on its own. Thus, if snapbuild users know it (decoding session for existing slot), we trust them and switch after snapshot deserialization into SNAPBUILD_CONSISTENT as previously. However, if it is not known (new slot creation), switch into SNAPBUILD_FULL_SNAPSHOT and wait for all running xacts to finish as usual. --- src/backend/replication/logical/logical.c | 2 +- src/backend/replication/logical/snapbuild.c | 141 ++++++++++++++++++++++------ 2 files changed, 115 insertions(+), 28 deletions(-) diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c index da265f5294..67ed52e56f 100644 --- a/src/backend/replication/logical/logical.c +++ b/src/backend/replication/logical/logical.c @@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin, ReplicationSlotMarkDirty(); ReplicationSlotSave(); - ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon, + ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon, need_full_snapshot, false, read_page, prepare_write, do_write, update_progress); diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index 0bd1d0f954..3ed178a4f8 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -165,7 +165,7 @@ struct SnapBuild /* * Don't replay commits from an LSN < this LSN. This can be set externally - * but it will also be advanced (never retreat) from within snapbuild.c. + * or established by snapbuild.c once consistent snapshot is assembled. */ XLogRecPtr start_decoding_at; @@ -275,7 +275,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof /* serialization functions */ static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn); -static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn); +static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn, TransactionId nextXid); /* * Return TransactionId after which the next phase of initial snapshot @@ -309,7 +309,42 @@ SnapBuildStartNextPhaseAt(SnapBuild *builder, TransactionId at) * Allocate a new snapshot builder. * * xmin_horizon is the xid >= which we can be sure no catalog rows have been - * removed, start_lsn is the LSN >= we want to replay commits. + * removed. + * start_lsn is InvalidXLogRecPtr or LSN >= we want to replay commits. + * InvalidXLogRecPtr is used during slot creation; snapbuild will assemble + * consistent snapshot and (if continue decoding -- no core code does that + * currently) stream all further commits. + * If normal lsn is passed, caller *must* be sure that WAL reading starts + * early enough to build the snapshot and pick up the first change of + * earliest xact to stream. Slot creation and advancement machinery + * guarantees that slot's <restart_lsn, confirmed_flush_lsn> pair always + * satisfies this. + * + * The condition is the caller's responsibility because in some cases + * snapbuilder has no way to enforce this rule without risk of skipping legit + * transactions. Specifically, after constructing (usually deserializing) FULL + * snapshot snapbuilder must iterate over WAL further until all running xacts + * from the snap finish, as we must see with correct snapshots all changes of + * every transaction we are going to stream. However, snapbuild can't + * *immediately* notice this point, because + * - Snapbuilder updates xmin by taking it from xl_running_xacts + * (c.f. SnapBuildProcessRunningXacts). Even if we guarantee that, for + * each possible WAL reading starting position, there is always an an + * xl_running_xacts records logged right before the earliest possible + * streaming point -- IOW, after all xacts which we can't stream had + * finished (which is currently true btw, as slot's advancement is + * considered only at xl_running_xacts) -- that would not be enough due + * to races around xl_running_xacts, i.e with WAL like + * [ <T1> <restart_lsn> <T1 commit> <confirmed_flush_lsn, xrx> <T2 commit> ] + * T2 might be skipped if T1 is shown as running in xl_running_xacts. + * - Tracking xmin manually by recoding commits is not only inefficient, + * it just not feasible because serialized snapshot is not full: it + * contains only committed catalog-modifying xacts. Thus, we can't + * distinguish non-catalog-modifying xact committed before serialized + * snapshot from not yet committed one. + * So, trust the caller: once given start_lsn is reached, it means we must + * have reached the point where all further xacts can be streamed, FULL -> + * CONSISTENT transition. */ SnapBuild * AllocateSnapshotBuilder(ReorderBuffer *reorder, @@ -408,7 +443,8 @@ SnapBuildCurrentState(SnapBuild *builder) bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr) { - return ptr < builder->start_decoding_at; + return XLogRecPtrIsInvalid(builder->start_decoding_at) || + ptr < builder->start_decoding_at; } /* @@ -945,16 +981,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder)))) { /* ensure that only commits after this are getting replayed */ - if (builder->start_decoding_at <= lsn) - builder->start_decoding_at = lsn + 1; + Assert(XLogRecPtrIsInvalid(builder->start_decoding_at) || + builder->start_decoding_at > lsn); return; } if (builder->state < SNAPBUILD_CONSISTENT) { /* ensure that only commits after this are getting replayed */ - if (builder->start_decoding_at <= lsn) - builder->start_decoding_at = lsn + 1; + Assert(XLogRecPtrIsInvalid(builder->start_decoding_at) || + builder->start_decoding_at > lsn); /* * If building an exportable snapshot, force xid to be tracked, even @@ -966,6 +1002,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, } } + if (!XLogRecPtrIsInvalid(builder->start_decoding_at) && + builder->start_decoding_at <= lsn) + { + /* + * We are going to stream this xact, so must already have fine + * snapshot. + */ + Assert(builder->state == SNAPBUILD_CONSISTENT); + } + for (nxact = 0; nxact < nsubxacts; nxact++) { TransactionId subxid = subxacts[nxact]; @@ -1250,10 +1296,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn */ if (running->oldestRunningXid == running->nextXid) { - if (builder->start_decoding_at == InvalidXLogRecPtr || - builder->start_decoding_at <= lsn) + if (XLogRecPtrIsInvalid(builder->start_decoding_at)) /* can decode everything after this */ builder->start_decoding_at = lsn + 1; + Assert(builder->start_decoding_at >= lsn); /* As no transactions were running xmin/xmax can be trivially set. */ builder->xmin = running->nextXid; /* < are finished */ @@ -1275,9 +1321,9 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn } /* b) valid on disk state and not building full snapshot */ else if (!builder->building_full_snapshot && - SnapBuildRestore(builder, lsn)) + SnapBuildRestore(builder, lsn, running->nextXid)) { - /* there won't be any state to cleanup */ + /* there won't be much state to cleanup */ return false; } @@ -1358,6 +1404,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn { builder->state = SNAPBUILD_CONSISTENT; SnapBuildStartNextPhaseAt(builder, InvalidTransactionId); + if (XLogRecPtrIsInvalid(builder->start_decoding_at)) + /* can decode everything after this */ + builder->start_decoding_at = lsn + 1; + Assert(builder->start_decoding_at >= lsn); ereport(LOG, (errmsg("logical decoding found consistent point at %X/%X", @@ -1471,8 +1521,8 @@ typedef struct SnapBuildOnDisk void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn) { - if (builder->state < SNAPBUILD_CONSISTENT) - SnapBuildRestore(builder, lsn); + if (builder->state < SNAPBUILD_FULL_SNAPSHOT) + SnapBuildRestore(builder, lsn, InvalidTransactionId); else SnapBuildSerialize(builder, lsn); } @@ -1499,10 +1549,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn) builder->last_serialized_snapshot <= lsn); /* - * no point in serializing if we cannot continue to work immediately after - * restoring the snapshot + * No point in serializing if the snapshot is not complete. + * However, FULL snapshot is just as good as CONSISTENT; difference + * between these states is not snapshot property, but whether we have + * filled reorderbuffer with all currently running xacts. */ - if (builder->state < SNAPBUILD_CONSISTENT) + if (builder->state < SNAPBUILD_FULL_SNAPSHOT) return; /* @@ -1688,10 +1740,15 @@ out: /* * Restore a snapshot into 'builder' if previously one has been stored at the - * location indicated by 'lsn'. Returns true if successful, false otherwise. + * location indicated by 'lsn'. + * nextXid is first not yet running xid as of this lsn or InvalidTransactionId; + * unless external code knows where it is safe to start streaming, we can + * use serialized snapshot only if we are aware which xids finish we must + * wait to be able to stream all further commits. + * Returns true if successful, false otherwise. */ static bool -SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn) +SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn, TransactionId nextXid) { SnapBuildOnDisk ondisk; int fd; @@ -1701,7 +1758,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn) pg_crc32c checksum; /* no point in loading a snapshot if we're already there */ - if (builder->state == SNAPBUILD_CONSISTENT) + if (builder->state >= SNAPBUILD_FULL_SNAPSHOT) return false; sprintf(path, "pg_logical/snapshots/%X-%X.snap", @@ -1884,11 +1941,47 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn) if (TransactionIdPrecedes(ondisk.builder.xmin, builder->initial_xmin_horizon)) goto snapshot_not_interesting; + /* + * Don't use snapshot if external code doesn't know where it is safe to + * start streaming and we have no idea which xids to wait for. + */ + if (XLogRecPtrIsInvalid(builder->start_decoding_at) && + !TransactionIdIsValid(nextXid)) + goto snapshot_not_interesting; /* ok, we think the snapshot is sensible, copy over everything important */ builder->xmin = ondisk.builder.xmin; builder->xmax = ondisk.builder.xmax; - builder->state = ondisk.builder.state; + Assert(ondisk.builder.state >= SNAPBUILD_FULL_SNAPSHOT); + + if (XLogRecPtrIsInvalid(builder->start_decoding_at)) + { + /* + * Snapshot is fine, now we need to wait till we see all further + * commits since the xact's first record. + */ + builder->state = SNAPBUILD_FULL_SNAPSHOT; + SnapBuildStartNextPhaseAt(builder, nextXid); + } + else + { + /* + * If external code (c.f. AllocateSnapshotBuilder) knows we would pick + * up all xacts in full before start_decoding_at, just go directly + * into CONSISTENT. Though we probably can't stream right now (as we + * haven't seen beginnings of some xacts), no xact will be streamed + * before start_decoding_at, and we can't be sure to switch into + * CONSISTENT later in time anyway. + */ + builder->state = SNAPBUILD_CONSISTENT; + SnapBuildStartNextPhaseAt(builder, InvalidTransactionId); + + ereport(LOG, + (errmsg("logical decoding found consistent point at %X/%X", + (uint32) (lsn >> 32), (uint32) lsn), + errdetail("Logical decoding will begin using saved snapshot."))); + } + builder->committed.xcnt = ondisk.builder.committed.xcnt; /* We only allocated/stored xcnt, not xcnt_space xids ! */ @@ -1911,12 +2004,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn) ReorderBufferSetRestartPoint(builder->reorder, lsn); - Assert(builder->state == SNAPBUILD_CONSISTENT); - - ereport(LOG, - (errmsg("logical decoding found consistent point at %X/%X", - (uint32) (lsn >> 32), (uint32) lsn), - errdetail("Logical decoding will begin using saved snapshot."))); return true; snapshot_not_interesting: -- 2.11.0 From ab798a087d99ea8d7d0df4c296fee5787cf5394e Mon Sep 17 00:00:00 2001 From: Arseny Sher <sher-ars@yandex.ru> Date: Wed, 23 Oct 2019 15:56:46 +0300 Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact in decoding. Manifested as ERROR: subtransaction logged without previous top-level txn record , this check forbids legit behaviours like - First xl_xact_assignment record is beyond reading, i.e. earlier restart_lsn. - After restart_lsn there is some change of a subxact. - After that, there is second xl_xact_assignment (for another subxact) revealing relationship between top and first subxact. Such transaction won't be streamed anyway because we hadn't seen it in full; confirmed_flush_lsn must be past all these records. Saying for sure whether xact of some record encountered after snapshot was deserialized can be streamed or not requires to know nextXid (first not yet running xid) as of this snapshot's lsn -- all xids < nextXid possibly hadn't been seen in full and are not subject to decoding. However, generally we don't know nextXid which is taken from xl_running_xacts; in particular snapshot can be serizalized/deserialized at XLOG_END_OF_RECOVERY. Changing that for the sake of the check in question is not worthwhile, so just throw it out instead of trying to relax. --- src/backend/replication/logical/reorderbuffer.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 8ce28ad629..6faba6077e 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid, txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true); subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false); - if (new_top && !new_sub) - elog(ERROR, "subtransaction logged without previous top-level txn record"); - if (!new_sub) { if (subtxn->is_known_as_subxact) -- 2.11.0
Hi, On 2019-10-24 12:59:30 +0300, Arseny Sher wrote: > Our customer also encountered this issue and I've looked into it. The problem is > reproduced well enough using the instructions in the previous message. Is this with commit bac2fae05c7737530a6fe8276cd27d210d25c6ac Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: 2019-09-13 16:36:28 -0300 logical decoding: process ASSIGNMENT during snapshot build Most WAL records are ignored in early SnapBuild snapshot build phases. But it's critical to process some of them, so that later messages have the correct transaction state after the snapshot is completely built; in particular, XLOG_XACT_ASSIGNMENT messages are critical in order for sub-transactions to be correctly assigned to their parent transactions, or at least one assert misbehaves, as reported by Ildar Musin. Diagnosed-by: Masahiko Sawada Author: Masahiko Sawada Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com applied? > The check leading to this ERROR is too strict, it forbids legit behaviours. Say > we have in WAL > > [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ] > > - First xl_xact_assignment record is beyond reading, i.e. earlier > restart_lsn, where ready snapshot will be taken from disk. > - After restart_lsn there is some change of a subxact. > - After that, there is second xl_xact_assignment (for another subxact) > revealing relationship between top and first subxact, where this ERROR fires. > > Such transaction won't be streamed because we hadn't seen it in full. It must be > finished before streaming will start, i.e. before confirmed_flush_lsn. > > Of course, the easiest fix is to just throw the check out. I don't think that'd actually be a fix, and just hiding a bug. > Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the > snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not > just complete snapshot (snapshot by itself in FULL state is just good as in > CONSISTENT), but also reorderbuffer filled with all currently running > xacts. This is painless for decoding sessions with existing slots because they > won't stream anything before confirmed_flush_lsn is reached anyway, at which > point all transactions which hadn't got into reorderbuffer would definitely > finish. However, new slots might be created too early, thus losing (not > decoding) parts of transactions committed after freshly created > confirmed_flush_lsn. This can happen under the following extremely unlucky > circumstances: > - New slot creation reserves point in WAL since which it would read it > (GetXLogInsertRecPtr); > - It logs xl_running_xacts to start assembling a snapshot; > - Running decoding session with another slot quickly reads this > xl_running_xacts and serializes its snapshot; > - New slot reads xl_running_xacts and picks this snapshot up, saying that it > is ready to stream henceforth, though its reorderbuffer is empty. Yea, that's a problem :( > Exact reproducing steps: > > -- session 1 > create table t (i int); > select pg_create_logical_replication_slot('slot_1', 'test_decoding'); > > -- session 2 > begin; > insert into t values (1); > > -- session 3, start slot creation > select pg_create_logical_replication_slot('slot_2', 'test_decoding'); > -- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx); > > -- session 1 > -- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no > -- need to sleep; our snap will be immediately serialized there > SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0'); > > -- continue slot_2 creation > > -- session 2: insert some more and commit > insert into t values (1); > commit; > > -- now this would find second insert, but not the first one > SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0'); It'd be good to make this into an isolation test. > What we can do here? I think the easiest fix might actually be to just ignore serialized snapshots when building the initial snapshot. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Hi, > > On 2019-10-24 12:59:30 +0300, Arseny Sher wrote: >> Our customer also encountered this issue and I've looked into it. The problem is >> reproduced well enough using the instructions in the previous message. > > Is this with > > commit bac2fae05c7737530a6fe8276cd27d210d25c6ac > Author: Alvaro Herrera <alvherre@alvh.no-ip.org> > Date: 2019-09-13 16:36:28 -0300 > > logical decoding: process ASSIGNMENT during snapshot build > > Most WAL records are ignored in early SnapBuild snapshot build phases. > But it's critical to process some of them, so that later messages have > the correct transaction state after the snapshot is completely built; in > particular, XLOG_XACT_ASSIGNMENT messages are critical in order for > sub-transactions to be correctly assigned to their parent transactions, > or at least one assert misbehaves, as reported by Ildar Musin. > > Diagnosed-by: Masahiko Sawada > Author: Masahiko Sawada > Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com > > applied? Yeah, I tried fresh master. See below: skipped xl_xact_assignment is beyond restart_lsn at all (and thus not read), so this doesn't matter. >> The check leading to this ERROR is too strict, it forbids legit behaviours. Say >> we have in WAL >> >> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ] >> >> - First xl_xact_assignment record is beyond reading, i.e. earlier >> restart_lsn, where ready snapshot will be taken from disk. >> - After restart_lsn there is some change of a subxact. >> - After that, there is second xl_xact_assignment (for another subxact) >> revealing relationship between top and first subxact, where this ERROR fires. >> >> Such transaction won't be streamed because we hadn't seen it in full. It must be >> finished before streaming will start, i.e. before confirmed_flush_lsn. >> >> Of course, the easiest fix is to just throw the check out. > > I don't think that'd actually be a fix, and just hiding a bug. I don't see a bug here. At least in reproduced scenario I see false alert, as explained above: transaction with skipped xl_xact_assignment won't be streamed as it finishes before confirmed_flush_lsn. And I am pretty sure people encountered in the field the same issue. In the end of my mail I have offered a way to relax this check instead of removing it to avoid false triggers: serialize/deserialize a snapshot only at xl_running_xacts to know nextXid, add function to snapbuilder to check whether xact can be streamed or not by looking at its xid, etc, somehow deal with existing serialized snaps which may be logged at END_OF_RECOVERY without nextXid. I don't believe this check is worth these complexities. If you think it does, I can do that though. > >> Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the >> snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not >> just complete snapshot (snapshot by itself in FULL state is just good as in >> CONSISTENT), but also reorderbuffer filled with all currently running >> xacts. This is painless for decoding sessions with existing slots because they >> won't stream anything before confirmed_flush_lsn is reached anyway, at which >> point all transactions which hadn't got into reorderbuffer would definitely >> finish. However, new slots might be created too early, thus losing (not >> decoding) parts of transactions committed after freshly created >> confirmed_flush_lsn. This can happen under the following extremely unlucky >> circumstances: >> - New slot creation reserves point in WAL since which it would read it >> (GetXLogInsertRecPtr); >> - It logs xl_running_xacts to start assembling a snapshot; >> - Running decoding session with another slot quickly reads this >> xl_running_xacts and serializes its snapshot; >> - New slot reads xl_running_xacts and picks this snapshot up, saying that it >> is ready to stream henceforth, though its reorderbuffer is empty. > > Yea, that's a problem :( > > >> Exact reproducing steps: >> >> -- session 1 >> create table t (i int); >> select pg_create_logical_replication_slot('slot_1', 'test_decoding'); >> >> -- session 2 >> begin; >> insert into t values (1); >> >> -- session 3, start slot creation >> select pg_create_logical_replication_slot('slot_2', 'test_decoding'); >> -- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx); >> >> -- session 1 >> -- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no >> -- need to sleep; our snap will be immediately serialized there >> SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0'); >> >> -- continue slot_2 creation >> >> -- session 2: insert some more and commit >> insert into t values (1); >> commit; >> >> -- now this would find second insert, but not the first one >> SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0'); > > It'd be good to make this into an isolation test. Yeah, but to get real chance of firing this requires kinda sleep/break in the middle of pg_create_logical_replication_slot execution, so I have no idea how to do that =( > >> What we can do here? > > I think the easiest fix might actually be to just ignore serialized > snapshots when building the initial snapshot. That's an option. However, that anyway requires the distinction between new slot creation and streaming from existing slot at snapbuilder level, which currently doesn't exist and which constitutes most part of my first patch. If we add that, changing between using and not using serialized snapshots in new slot creation is easy (my patch uses it), I think this is not principal. -- Arseny Sher
Andres, Álvaro, could you please have a look at this? -- Arseny Sher
Is the resolution of the issue in this thread being tracked elsewhere, either in a commit fest or other stream of work?
On Tue, Dec 17, 2019 at 9:47 AM Arseny Sher <a.sher@postgrespro.ru> wrote:
Andres, Álvaro, could you please have a look at this?
--
Arseny Sher
Dan Katz <dkatz@joor.com> writes: > Is the resolution of the issue in this thread being tracked elsewhere, > either in a commit fest or other stream of work? Ok, I've created a cf entry: https://commitfest.postgresql.org/26/2383/ I believe it is the most important to commit at least 0002-Stop-demanding-that-top-xact-must-be-seen-before-sub.patch from my mail above -- as we see, this issue creates problems in the field. Moreover, the patch is trivial and hopefully I've shown that ERROR triggers were spurious and there is no easy way to relax the check. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Arseny,
I was hoping you could give me some insights about how this bug might appear with multiple replications slots. For example if I have two replication slots would you expect both slots to see the same error, even if they were started, consumed or the LSN was confirmed-flushed at different times?
Dan
On Tue, Dec 17, 2019 at 10:15 AM Arseny Sher <a.sher@postgrespro.ru> wrote:
Dan Katz <dkatz@joor.com> writes:
> Is the resolution of the issue in this thread being tracked elsewhere,
> either in a commit fest or other stream of work?
Ok, I've created a cf entry:
https://commitfest.postgresql.org/26/2383/
I believe it is the most important to commit at least
0002-Stop-demanding-that-top-xact-must-be-seen-before-sub.patch
from my mail above -- as we see, this issue creates problems in the
field. Moreover, the patch is trivial and hopefully I've shown that
ERROR triggers were spurious and there is no easy way to relax the
check.
--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi, Dan Katz <dkatz@joor.com> writes: > Arseny, > > I was hoping you could give me some insights about how this bug might > appear with multiple replications slots. For example if I have two > replication slots would you expect both slots to see the same error, even > if they were started, consumed or the LSN was confirmed-flushed at > different times? Well, to encounter this you must happen to interrupt decoding session (e.g. shutdown server) when restart_lsn (LSN since WAL will be read next time) is at unfortunate position, as described in https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad Generally each slot has its own restart_lsn, so if one decoding session stucked on this issue, another one won't necessarily fail at the same time. However, restart_lsn can be advanced only to certain points, mainly xl_running_xacts records, which is logged every 15 seconds. So if all consumers acknowledge changes fast enough, it is quite likely that during shutdown restart_lsn will be the same for all slots -- which means either all of them will stuck on further decoding or all of them won't. If not, different slots might have different restart_lsn and probably won't fail at the same time; but encountering this issue even once suggests that your workload makes possibility of such problematic restart_lsn perceptible (i.e. many subtransactions). And each restart_lsn probably has approximately the same chance to be 'bad' (provided the workload is even). We need a committer familiar with this code to look here... -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Oct 25, 2019 at 12:26 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > Andres Freund <andres@anarazel.de> writes: > > > Hi, > > > > On 2019-10-24 12:59:30 +0300, Arseny Sher wrote: > >> Our customer also encountered this issue and I've looked into it. The problem is > >> reproduced well enough using the instructions in the previous message. > > > > Is this with > > > > commit bac2fae05c7737530a6fe8276cd27d210d25c6ac > > Author: Alvaro Herrera <alvherre@alvh.no-ip.org> > > Date: 2019-09-13 16:36:28 -0300 > > > > logical decoding: process ASSIGNMENT during snapshot build > > > > Most WAL records are ignored in early SnapBuild snapshot build phases. > > But it's critical to process some of them, so that later messages have > > the correct transaction state after the snapshot is completely built; in > > particular, XLOG_XACT_ASSIGNMENT messages are critical in order for > > sub-transactions to be correctly assigned to their parent transactions, > > or at least one assert misbehaves, as reported by Ildar Musin. > > > > Diagnosed-by: Masahiko Sawada > > Author: Masahiko Sawada > > Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com > > > > applied? > > Yeah, I tried fresh master. See below: skipped xl_xact_assignment is > beyond restart_lsn at all (and thus not read), so this doesn't matter. > > > >> The check leading to this ERROR is too strict, it forbids legit behaviours. Say > >> we have in WAL > >> > >> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ] > >> > >> - First xl_xact_assignment record is beyond reading, i.e. earlier > >> restart_lsn, where ready snapshot will be taken from disk. > >> - After restart_lsn there is some change of a subxact. > >> - After that, there is second xl_xact_assignment (for another subxact) > >> revealing relationship between top and first subxact, where this ERROR fires. > >> > >> Such transaction won't be streamed because we hadn't seen it in full. It must be > >> finished before streaming will start, i.e. before confirmed_flush_lsn. > >> > >> Of course, the easiest fix is to just throw the check out. > > > > I don't think that'd actually be a fix, and just hiding a bug. > > I don't see a bug here. At least in reproduced scenario I see false > alert, as explained above: transaction with skipped xl_xact_assignment > won't be streamed as it finishes before confirmed_flush_lsn. > Does this guarantee come from the fact that we need to wait for such a transaction before reaching a consistent snapshot state? If not, can you explain a bit more what makes you say so? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: >> I don't see a bug here. At least in reproduced scenario I see false >> alert, as explained above: transaction with skipped xl_xact_assignment >> won't be streamed as it finishes before confirmed_flush_lsn. >> > > Does this guarantee come from the fact that we need to wait for such a > transaction before reaching a consistent snapshot state? If not, can > you explain a bit more what makes you say so? Right, see FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition -- it exists exactly for this purpose: once we have good snapshot, we need to wait for all running xacts to finish to see all xacts we are promising to stream in full. This ensures <restart_lsn, confirmed_flush_lsn> pair is good (reading WAL since the former is enough to see all xacts committing after the latter in full) initially, and slot advancement arrangements ensure it stays good forever (see LogicalIncreaseRestartDecodingForSlot). Well, almost. This is true as long initial snapshot construction process goes the long way of building the snapshot by itself. If it happens to pick up from disk ready snapshot pickled there by another decoding session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a bug as described in https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad In theory, this bug could indeed lead to 'subtransaction logged without previous top-level txn record' error. In practice, I think its possibility is disappearingly small -- process of slot creation must be intervened in a very short gap by another decoder who serializes its snapshot (see the exact sequence of steps in the mail above). What is much more probable (doesn't involve new slot creation and relatively easily reproducible without sleeps) is false alert triggered by unlucky position of restart_lsn. Surely we still must fix it. I just mean - People definitely encountered false alert, not this bug (at least because nobody said this was immediately after slot creation). - I've no bright ideas how to relax the check to make it proper without additional complications and I'm pretty sure this is impossible (again, see above for details), so I'd remove it. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Feb 3, 2020 at 2:50 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > >> I don't see a bug here. At least in reproduced scenario I see false > >> alert, as explained above: transaction with skipped xl_xact_assignment > >> won't be streamed as it finishes before confirmed_flush_lsn. > >> > > > > Does this guarantee come from the fact that we need to wait for such a > > transaction before reaching a consistent snapshot state? If not, can > > you explain a bit more what makes you say so? > > Right, see FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition -- it exists > exactly for this purpose: once we have good snapshot, we need to wait > for all running xacts to finish to see all xacts we are promising to > stream in full. > So, doesn't this mean that it started occurring after the fix done in commit 96b5033e11 [1]? Because before that fix we wouldn't have allowed processing XLOG_XACT_ASSIGNMENT records unless we are in SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that commit is wrong, but just trying to understand the situation here. > > Well, almost. This is true as long initial snapshot construction process > goes the long way of building the snapshot by itself. If it happens to > pick up from disk ready snapshot pickled there by another decoding > session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a > bug as described in > https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad > Can't we deal with this separately? If so, I think let's not mix the discussions for both as the root cause of both seems different. [1] - commit bac2fae05c7737530a6fe8276cd27d210d25c6ac Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: 2019-09-13 16:36:28 -0300 logical decoding: process ASSIGNMENT during snapshot build -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > So, doesn't this mean that it started occurring after the fix done in > commit 96b5033e11 [1]? Because before that fix we wouldn't have > allowed processing XLOG_XACT_ASSIGNMENT records unless we are in > SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that > commit is wrong, but just trying to understand the situation here. Nope. Consider again example of WAL above triggering the error: [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ] Decoder starting reading WAL at <restart_lsn> where he immediately reads from disk snapshot serialized earlier, which makes it jump to SNAPBUILD_CONSISTENT right away. It doesn't read xl_xact_assignment_1, but it reads xl_xact_assignment_2 already in SNAPBUILD_CONSISTENT state, so catches the error regardless of this commit. >> Well, almost. This is true as long initial snapshot construction process >> goes the long way of building the snapshot by itself. If it happens to >> pick up from disk ready snapshot pickled there by another decoding >> session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a >> bug as described in >> https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad >> > > Can't we deal with this separately? If so, I think let's not mix the > discussions for both as the root cause of both seems different. These issues are related: before removing the check it would be nice to ensure that there is no bugs it might protect us from (and it turns out there actually is, though it won't always protect, and though this bug has very small probability). Moreover, they are about more or less subject -- avoiding partially decoded xacts -- and once you dived deep enough to deal with one, it is reasonable to deal with another instead of doing that twice. But as a practical matter, removing the check is simple one-liner, and its presence causes people troubles -- so I'd suggest doing that first and then deal with the rest. I don't think starting new thread is worthwhile here, but if you think it does, I can create it. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > Amit Kapila <amit.kapila16@gmail.com> writes: > > > So, doesn't this mean that it started occurring after the fix done in > > commit 96b5033e11 [1]? Because before that fix we wouldn't have > > allowed processing XLOG_XACT_ASSIGNMENT records unless we are in > > SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that > > commit is wrong, but just trying to understand the situation here. > > Nope. Consider again example of WAL above triggering the error: > > [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ] > > Decoder starting reading WAL at <restart_lsn> where he immediately reads > from disk snapshot serialized earlier, which makes it jump to > SNAPBUILD_CONSISTENT right away. > Sure, I understand that if we get serialized snapshot from disk, this problem can occur and probably we can fix by ignoring serialized snapshots during slot creation or something on those lines. However, what I am trying to understand is whether this can occur from another path as well. I think it might occur without using serialized snapshots as well because we allow decoding xl_xact_assignment record even when the snapshot state is not full. Say in your above example, even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't used the serialized snapshot, then also, it can lead to the above problem due to decoding of xl_xact_assignment. I have not tried to generate a test case for this, so I could easily be wrong here. What I am trying to get at is if the problem can only occur by using serialized snapshots and we fix it by somehow not using them initial slot creation, then ideally we don't need to remove the error in ReorderBufferAssignChild, but if that is not the case, then we need to discuss other cases as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <a.sher@postgrespro.ru> wrote: >> Amit Kapila <amit.kapila16@gmail.com> writes: >> >> > So, doesn't this mean that it started occurring after the fix done in >> > commit 96b5033e11 [1]? Because before that fix we wouldn't have >> > allowed processing XLOG_XACT_ASSIGNMENT records unless we are in >> > SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that >> > commit is wrong, but just trying to understand the situation here. >> >> Nope. Consider again example of WAL above triggering the error: >> >> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ] >> >> Decoder starting reading WAL at <restart_lsn> where he immediately reads >> from disk snapshot serialized earlier, which makes it jump to >> SNAPBUILD_CONSISTENT right away. >> > > Sure, I understand that if we get serialized snapshot from disk, this > problem can occur and probably we can fix by ignoring serialized > snapshots during slot creation or something on those lines. There is some confusion. I'll try to reword what we have here: 1) Decoding from existing slot (*not* initial snapshot construction) starts up, immediately picks up snapshot at restart_lsn (getting into SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it hadn't seen in full (no toplevel records) transaction which it is not even going to stream -- but still dies with "subtransation logged without...". That's my example above, and that's what people are complaining about. Here, usage of serialized snapshot and jump to SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be able to stream data since confirmed_flush_lsn, we must pick it up as we might not be able to assemble it from scratch in time. I mean, what is wrong here is not serialized snapshot usage but the check. (Lengthy comment to AllocateSnapshotBuilder in my 0001-Fix-serialized-snapshot-usage-for-new-logical-slots.patch explains why snapbuilder is not able to do FULL -> CONSISTENT transition on its own early enough for decoding from existing slot, so the jump on snapshot pickup is performed to CONSISTENT directly.) This happens with or without bac2fae05c. 2) *New* slot creationg process picks up serialized snapshot and jumps to CONSISTENT without waiting for all running xacts to finish. This is wrong and is a bug (of very low probability), as we risk promising to decode xacts which we might not seen in full. Sometimes it could be arrested by "subtransation logged without..." check, but not necessarily -- e.g. there could be no subxacts at all. > However, > what I am trying to understand is whether this can occur from another > path as well. I think it might occur without using serialized > snapshots as well because we allow decoding xl_xact_assignment record > even when the snapshot state is not full. Say in your above example, > even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't > used the serialized snapshot, then also, it can lead to the above > problem due to decoding of xl_xact_assignment. I have not tried to > generate a test case for this, so I could easily be wrong here. What you are suggesting here is 3), which is, well, sort of form of 1), meaning "subxact logged..." error also pointlessly triggered, but for new slot creation. With bac2fae0, decoder might miss first xl_xact_assignment because it is beyond start of reading WAL but encounter second xl_xact_assignment and die on it due to this check before even getting FULL. But now that I'm thinking about it, I suspect that similar could happen even before bac2fae0. Imagine <start_of_reading_wal> <xl_xact_assignment_1> <SNAPBUILD_FULL> <subxact_change> <xl_xact_assignment_2> <commit> ... <SNAPBUILD_CONSISTENT> Before bac2fae0, xl_xact_assignment_1 was ignored, so xl_xact_assignment_1 would trigger the error. > What I am trying to get at is if the problem can only occur by using > serialized snapshots and we fix it by somehow not using them initial > slot creation, then ideally we don't need to remove the error in > ReorderBufferAssignChild, but if that is not the case, then we need to > discuss other cases as well. So, 1) and 3) mean this is not the case. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Feb 5, 2020 at 2:34 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > >> Amit Kapila <amit.kapila16@gmail.com> writes: > >> > >> > So, doesn't this mean that it started occurring after the fix done in > >> > commit 96b5033e11 [1]? Because before that fix we wouldn't have > >> > allowed processing XLOG_XACT_ASSIGNMENT records unless we are in > >> > SNAPBUILD_FULL_SNAPSHOT state. I am not telling the fix in that > >> > commit is wrong, but just trying to understand the situation here. > >> > >> Nope. Consider again example of WAL above triggering the error: > >> > >> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ] > >> > >> Decoder starting reading WAL at <restart_lsn> where he immediately reads > >> from disk snapshot serialized earlier, which makes it jump to > >> SNAPBUILD_CONSISTENT right away. > >> > > > > Sure, I understand that if we get serialized snapshot from disk, this > > problem can occur and probably we can fix by ignoring serialized > > snapshots during slot creation or something on those lines. > > There is some confusion. I'll try to reword what we have here: > > 1) Decoding from existing slot (*not* initial snapshot construction) > starts up, immediately picks up snapshot at restart_lsn (getting into > SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it > hadn't seen in full (no toplevel records) transaction which it is not > even going to stream -- but still dies with "subtransation logged > without...". That's my example above, and that's what people are > complaining about. Here, usage of serialized snapshot and jump to > SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be > able to stream data since confirmed_flush_lsn, we must pick it up as we > might not be able to assemble it from scratch in time. I mean, what is > wrong here is not serialized snapshot usage but the check. > I was thinking if we have some way to skip processing of xl_xact_assignment for such cases, then it might be better. Say, along with restart_lsn, if have some way to find corresponding nextXid (below which we don't need to process records). Say, if, during SnapBuildProcessRunningXacts, we record the xid of txn we got by ReorderBufferGetOldestTXN in slot, then can't we use it to skip such records. > (Lengthy comment to AllocateSnapshotBuilder in my > 0001-Fix-serialized-snapshot-usage-for-new-logical-slots.patch explains > why snapbuilder is not able to do FULL -> CONSISTENT transition on its > own early enough for decoding from existing slot, so the jump on > snapshot pickup is performed to CONSISTENT directly.) > > This happens with or without bac2fae05c. > > 2) *New* slot creationg process picks up serialized snapshot and jumps > to CONSISTENT without waiting for all running xacts to finish. This is > wrong and is a bug (of very low probability), as we risk promising to > decode xacts which we might not seen in full. Sometimes it could be > arrested by "subtransation logged without..." check, but not necessarily > -- e.g. there could be no subxacts at all. > > > > However, > > what I am trying to understand is whether this can occur from another > > path as well. I think it might occur without using serialized > > snapshots as well because we allow decoding xl_xact_assignment record > > even when the snapshot state is not full. Say in your above example, > > even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't > > used the serialized snapshot, then also, it can lead to the above > > problem due to decoding of xl_xact_assignment. I have not tried to > > generate a test case for this, so I could easily be wrong here. > > What you are suggesting here is 3), which is, well, sort of form of 1), > meaning "subxact logged..." error also pointlessly triggered, but for > new slot creation. With bac2fae0, decoder might miss first > xl_xact_assignment because it is beyond start of reading WAL but > encounter second xl_xact_assignment and die on it due to this check > before even getting FULL. > > But now that I'm thinking about it, I suspect that similar could happen > even before bac2fae0. Imagine > > <start_of_reading_wal> <xl_xact_assignment_1> <SNAPBUILD_FULL> <subxact_change> <xl_xact_assignment_2> <commit> ... <SNAPBUILD_CONSISTENT> > > Before bac2fae0, xl_xact_assignment_1 was ignored, so > xl_xact_assignment_1 would trigger the error. > 'xl_xact_assignment_1 would trigger the error', I think in this part of sentence you mean to say xl_xact_assignment_2 because we won't try to decode xl_xact_assignment_1 before bac2fae0. If so, won't we wait for such a transaction to finish while changing the snapshot state from SNAPBUILD_BUILDING_SNAPSHOT to SNAPBUILD_FULL_SNAPSHOT? And if the transaction is finished, ideally, we should not try to decode its WAL and or create its ReorderBufferTxn. > > What I am trying to get at is if the problem can only occur by using > > serialized snapshots and we fix it by somehow not using them initial > > slot creation, then ideally we don't need to remove the error in > > ReorderBufferAssignChild, but if that is not the case, then we need to > > discuss other cases as well. > > So, 1) and 3) mean this is not the case. > Right, I am thinking that if we can find some way to skip the xact assignment for (1) and (3), then that might be a better fix. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Feb 7, 2020 at 4:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 5, 2020 at 2:34 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > > > > What I am trying to get at is if the problem can only occur by using > > > serialized snapshots and we fix it by somehow not using them initial > > > slot creation, then ideally we don't need to remove the error in > > > ReorderBufferAssignChild, but if that is not the case, then we need to > > > discuss other cases as well. > > > > So, 1) and 3) mean this is not the case. > > > > Right, I am thinking that if we can find some way to skip the xact > assignment for (1) and (3), then that might be a better fix. > Just to be clear, I am just brainstorming the ideas to see if we can find some better solutions to the problems (1) and (3) described by Arseny in the above email [1]. At this stage, it is not clear to me that we have a fix simple enough to backpatch apart from what Arseny posted in his fist email [2] (which is to stop demanding that top xact must be seen before subxact in decoding.). [1] - https://www.postgresql.org/message-id/87zhdx76d5.fsf%40ars-thinkpad [2] - https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: >> 1) Decoding from existing slot (*not* initial snapshot construction) >> starts up, immediately picks up snapshot at restart_lsn (getting into >> SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it >> hadn't seen in full (no toplevel records) transaction which it is not >> even going to stream -- but still dies with "subtransation logged >> without...". That's my example above, and that's what people are >> complaining about. Here, usage of serialized snapshot and jump to >> SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be >> able to stream data since confirmed_flush_lsn, we must pick it up as we >> might not be able to assemble it from scratch in time. I mean, what is >> wrong here is not serialized snapshot usage but the check. >> > > I was thinking if we have some way to skip processing of > xl_xact_assignment for such cases, then it might be better. Say, > along with restart_lsn, if have some way to find corresponding nextXid > (below which we don't need to process records). I don't believe you can that without persisting additional data. Basically, what we need is list of transactions who are running at the point of snapshot serialization *and* already wrote something before it -- those we hadn't seen in full and can't decode. We have no such data currently. The closest thing we have is xl_running_xacts->nextXid, but 1) issued xid doesn't necessarily means xact actually wrote something, so we can't just skip xl_xact_assignment for xid < nextXid, it might still be decoded 2) snapshot might be serialized not at xl_running_xacts anyway Surely this thing doesn't deserve changing persisted data format. Somehow I hadn't realized this earlier, so my comments/commit messages in patches above were not accurate here; I've edited them. Also in the first patch serialized snapshots are not no longer used for new slot creation at all, as Andres suggested above. This is not principal, as I said, but arguably makes things simpler a bit. I've also found a couple of issues with slot copying feature, will post in separate thread on them. From 8952bc955585d12a06cdd6d339bcbabafc772768 Mon Sep 17 00:00:00 2001 From: Arseny Sher <sher-ars@yandex.ru> Date: Tue, 22 Oct 2019 19:02:14 +0300 Subject: [PATCH 1/2] Don't use serialized snapshots during logical slot creation. snapbuild.c enters SNAPBUILD_CONSISTENT immediately after deserializing the snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not just complete snapshot, but also reorderbuffer filled with all currently running xacts. This is fine for decoding sessions with existing slots though as they won't stream anything before confirmed_flush_lsn is reached anyway, at which point all transactions which hadn't got into reorderbuffer would definitely finish. However, new slots might be created too early, thus losing (not decoding) parts of transactions committed after freshly created confirmed_flush_lsn. This can happen under the following extremely unlucky circumstances: - New slot creation reserves point in WAL since which it would read it (GetXLogInsertRecPtr); - It logs xl_running_xacts to start assembling a snapshot; - Running decoding session with another slot quickly reads this xl_running_xacts and serializes its snapshot; - New slot reads xl_running_xacts and picks this snapshot up, saying that it is ready to stream henceforth, though its reorderbuffer is empty. Fix by forbidding to use serialized snapshots during slot creation. Actually we could use snapshot if it was picked at xl_running_xacts where we could perform usual FULL -> CONSISTENT transition conservatively via nextXid, but that seems like a complication for a too small benefit. --- src/backend/replication/logical/logical.c | 2 +- src/backend/replication/logical/snapbuild.c | 85 +++++++++++++++++++++++------ 2 files changed, 70 insertions(+), 17 deletions(-) diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c index da265f5294..67ed52e56f 100644 --- a/src/backend/replication/logical/logical.c +++ b/src/backend/replication/logical/logical.c @@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin, ReplicationSlotMarkDirty(); ReplicationSlotSave(); - ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon, + ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon, need_full_snapshot, false, read_page, prepare_write, do_write, update_progress); diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index 0bd1d0f954..6f503a93aa 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -165,7 +165,7 @@ struct SnapBuild /* * Don't replay commits from an LSN < this LSN. This can be set externally - * but it will also be advanced (never retreat) from within snapbuild.c. + * or established by snapbuild.c once consistent snapshot is assembled. */ XLogRecPtr start_decoding_at; @@ -309,7 +309,16 @@ SnapBuildStartNextPhaseAt(SnapBuild *builder, TransactionId at) * Allocate a new snapshot builder. * * xmin_horizon is the xid >= which we can be sure no catalog rows have been - * removed, start_lsn is the LSN >= we want to replay commits. + * removed. + * start_lsn is InvalidXLogRecPtr or LSN >= we want to replay commits. + * InvalidXLogRecPtr is used during slot creation; snapbuild will assemble + * consistent snapshot and (if continue decoding -- no core code does that + * currently) stream all further commits. + * If normal LSN is passed, caller *must* be sure that WAL reading starts + * early enough to get the snapshot and pick up the first change of + * earliest xact to stream. Slot creation and advancement machinery + * guarantees that slot's <restart_lsn, confirmed_flush_lsn> pair always + * satisfies this. */ SnapBuild * AllocateSnapshotBuilder(ReorderBuffer *reorder, @@ -408,7 +417,8 @@ SnapBuildCurrentState(SnapBuild *builder) bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr) { - return ptr < builder->start_decoding_at; + return XLogRecPtrIsInvalid(builder->start_decoding_at) || + ptr < builder->start_decoding_at; } /* @@ -945,16 +955,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder)))) { /* ensure that only commits after this are getting replayed */ - if (builder->start_decoding_at <= lsn) - builder->start_decoding_at = lsn + 1; + Assert(XLogRecPtrIsInvalid(builder->start_decoding_at) || + builder->start_decoding_at > lsn); return; } if (builder->state < SNAPBUILD_CONSISTENT) { /* ensure that only commits after this are getting replayed */ - if (builder->start_decoding_at <= lsn) - builder->start_decoding_at = lsn + 1; + Assert(XLogRecPtrIsInvalid(builder->start_decoding_at) || + builder->start_decoding_at > lsn); /* * If building an exportable snapshot, force xid to be tracked, even @@ -966,6 +976,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, } } + if (!XLogRecPtrIsInvalid(builder->start_decoding_at) && + builder->start_decoding_at <= lsn) + { + /* + * We are going to stream this xact, so must already have fine + * snapshot. + */ + Assert(builder->state == SNAPBUILD_CONSISTENT); + } + for (nxact = 0; nxact < nsubxacts; nxact++) { TransactionId subxid = subxacts[nxact]; @@ -1250,10 +1270,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn */ if (running->oldestRunningXid == running->nextXid) { - if (builder->start_decoding_at == InvalidXLogRecPtr || - builder->start_decoding_at <= lsn) + if (XLogRecPtrIsInvalid(builder->start_decoding_at)) /* can decode everything after this */ builder->start_decoding_at = lsn + 1; + Assert(builder->start_decoding_at >= lsn); /* As no transactions were running xmin/xmax can be trivially set. */ builder->xmin = running->nextXid; /* < are finished */ @@ -1277,7 +1297,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn else if (!builder->building_full_snapshot && SnapBuildRestore(builder, lsn)) { - /* there won't be any state to cleanup */ + /* there won't be much state to cleanup */ return false; } @@ -1358,6 +1378,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn { builder->state = SNAPBUILD_CONSISTENT; SnapBuildStartNextPhaseAt(builder, InvalidTransactionId); + if (XLogRecPtrIsInvalid(builder->start_decoding_at)) + /* can decode everything after this */ + builder->start_decoding_at = lsn + 1; + Assert(builder->start_decoding_at >= lsn); ereport(LOG, (errmsg("logical decoding found consistent point at %X/%X", @@ -1499,8 +1523,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn) builder->last_serialized_snapshot <= lsn); /* - * no point in serializing if we cannot continue to work immediately after - * restoring the snapshot + * No point in serializing if the snapshot is not complete (< FULL). + * FULL snapshot per se is just as good as CONSISTENT; difference between + * these states is not a snapshot property, but whether we have filled + * reorderbuffer with all currently running xacts. However, there isn't + * much sense in serializing it as serialized snaps are used only for + * decoding from existing slots, which had already reached CONSISTENT. */ if (builder->state < SNAPBUILD_CONSISTENT) return; @@ -1688,7 +1716,8 @@ out: /* * Restore a snapshot into 'builder' if previously one has been stored at the - * location indicated by 'lsn'. Returns true if successful, false otherwise. + * location indicated by 'lsn'. + * Returns true if successful, false otherwise. */ static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn) @@ -1704,6 +1733,16 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn) if (builder->state == SNAPBUILD_CONSISTENT) return false; + /* + * Don't use serialized snapshot if we are not sure where all + * currently running xacts will finish (new slot creation). + * (Actually, if we came here through xl_running_xacts, we could perform + * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly, + * but added lines of code would hardly worth the benefit.) + */ + if (builder->start_decoding_at == InvalidXLogRecPtr) + return false; + sprintf(path, "pg_logical/snapshots/%X-%X.snap", (uint32) (lsn >> 32), (uint32) lsn); @@ -1888,7 +1927,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn) /* ok, we think the snapshot is sensible, copy over everything important */ builder->xmin = ondisk.builder.xmin; builder->xmax = ondisk.builder.xmax; - builder->state = ondisk.builder.state; builder->committed.xcnt = ondisk.builder.committed.xcnt; /* We only allocated/stored xcnt, not xcnt_space xids ! */ @@ -1901,6 +1939,23 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn) } ondisk.builder.committed.xip = NULL; + /* + * We probably don't have in reorderbuffer parts of currently running + * xacts right now. However, snapbuild can't determine *earliest* spot + * where decoding is safe (we'd see all xacts in full) on its own, because + * this requires to know which xids already wrote something, and such + * writes are beyond our reading of WAL. xl_running_xacts->nextXid + * doesn't help us here, because 1) issued xid doesn't necessarily means + * xact wrote something, so there would be a risk of arriving at + * CONSISTENT a bit later than possible which is ok for new slot creation, + * but unacceptable otherwise; 2) snapshot might be serialized at some + * other point. + * + * Slot must have provided as with safe start_decoding_at though, so just + * skip FULL -> CONSISTENT transition and set CONSISTENT directly. + */ + builder->state = SNAPBUILD_CONSISTENT; + /* our snapshot is not interesting anymore, build a new one */ if (builder->snapshot != NULL) { @@ -1911,8 +1966,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn) ReorderBufferSetRestartPoint(builder->reorder, lsn); - Assert(builder->state == SNAPBUILD_CONSISTENT); - ereport(LOG, (errmsg("logical decoding found consistent point at %X/%X", (uint32) (lsn >> 32), (uint32) lsn), -- 2.11.0 From 1449c8aacd2feadfa104a1ea4b73030e37cbba14 Mon Sep 17 00:00:00 2001 From: Arseny Sher <sher-ars@yandex.ru> Date: Wed, 23 Oct 2019 15:56:46 +0300 Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact in decoding. Manifested as ERROR: subtransaction logged without previous top-level txn record this check forbids legit behaviours like - First xl_xact_assignment record is beyond reading, i.e. earlier restart_lsn. - After restart_lsn there is some change of a subxact. - After that, there is second xl_xact_assignment (for another subxact) revealing relationship between top and first subxact. Such transaction won't be streamed anyway because we hadn't seen it in full; confirmed_flush_lsn must be past all these records. Saying for sure whether xact of some record encountered after snapshot was deserialized can be streamed or not requires to know whether it wrote something before deserialization point -- if yes, it hasn't been seen in full and can't be decoded. Snapshot doesn't have such info, so there is no easy way to relax the check. --- src/backend/replication/logical/reorderbuffer.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 8ce28ad629..6faba6077e 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid, txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true); subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false); - if (new_top && !new_sub) - elog(ERROR, "subtransaction logged without previous top-level txn record"); - if (!new_sub) { if (subtxn->is_known_as_subxact) -- 2.11.0 -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Sun, Feb 9, 2020 at 9:37 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > >> 1) Decoding from existing slot (*not* initial snapshot construction) > >> starts up, immediately picks up snapshot at restart_lsn (getting into > >> SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it > >> hadn't seen in full (no toplevel records) transaction which it is not > >> even going to stream -- but still dies with "subtransation logged > >> without...". That's my example above, and that's what people are > >> complaining about. Here, usage of serialized snapshot and jump to > >> SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be > >> able to stream data since confirmed_flush_lsn, we must pick it up as we > >> might not be able to assemble it from scratch in time. I mean, what is > >> wrong here is not serialized snapshot usage but the check. > >> > > > > I was thinking if we have some way to skip processing of > > xl_xact_assignment for such cases, then it might be better. Say, > > along with restart_lsn, if have some way to find corresponding nextXid > > (below which we don't need to process records). > > I don't believe you can that without persisting additional > data. Basically, what we need is list of transactions who are running at > the point of snapshot serialization *and* already wrote something before > it -- those we hadn't seen in full and can't decode. We have no such > data currently. The closest thing we have is xl_running_xacts->nextXid, > but > > 1) issued xid doesn't necessarily means xact actually wrote something, > so we can't just skip xl_xact_assignment for xid < nextXid, it might > still be decoded > 2) snapshot might be serialized not at xl_running_xacts anyway > > Surely this thing doesn't deserve changing persisted data format. > I agree that it won't be a good idea to change the persisted data format, especially in back-branches. I don't see any fix which can avoid this without doing major changes in the code. Apart from this, we have to come up with a solution for point (3) discussed in the above email [1] which again could be change in design. I think we can first try to proceed with the patch 0002-Stop-demanding-that-top-xact-must-be-seen-before--v2 and then we can discuss the other patch. I can't see a way to write a test case for this, can you think of any way? Andres, anyone else, if you have a better idea other than changing the code (removing the expected error) as in 0002-Stop-demanding-that-top-xact-must-be-seen-before--v2, then please, let us know. You can read the points (1) and (3) in the email above [1] where the below error check will hit for valid cases. We have discussed this in detail, but couldn't come up with anything better than to remove this check. @@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid, txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true); subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false); - if (new_top && !new_sub) - elog(ERROR, "subtransaction logged without previous top-level txn record"); - [1] - https://www.postgresql.org/message-id/87zhdx76d5.fsf%40ars-thinkpad -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: >> I don't believe you can that without persisting additional >> data. Basically, what we need is list of transactions who are running at >> the point of snapshot serialization *and* already wrote something before >> it -- those we hadn't seen in full and can't decode. We have no such >> data currently. The closest thing we have is xl_running_xacts->nextXid, >> but >> >> 1) issued xid doesn't necessarily means xact actually wrote something, >> so we can't just skip xl_xact_assignment for xid < nextXid, it might >> still be decoded >> 2) snapshot might be serialized not at xl_running_xacts anyway >> >> Surely this thing doesn't deserve changing persisted data format. >> > > I agree that it won't be a good idea to change the persisted data > format, especially in back-branches. I don't see any fix which can > avoid this without doing major changes in the code. Apart from this, > we have to come up with a solution for point (3) discussed in the > above email [1] which again could be change in design. I think we can > first try to proceed with the patch > 0002-Stop-demanding-that-top-xact-must-be-seen-before--v2 and then we > can discuss the other patch. I can't see a way to write a test case > for this, can you think of any way? Yeah, let's finally get it. Attached is raw version of isolation test triggering false 'subtransaction logged without...' (case (1)). However, frankly I don't see much value in it, so I'm dubious whether it should be included in the patch. diff --git a/contrib/test_decoding/specs/subxact_logged_without_top.spec b/contrib/test_decoding/specs/subxact_logged_without_top.spec new file mode 100644 index 0000000000..55b51357a9 --- /dev/null +++ b/contrib/test_decoding/specs/subxact_logged_without_top.spec @@ -0,0 +1,51 @@ + +setup +{ + SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write inxact + CREATE TABLE harvest(apples integer); + CREATE OR REPLACE FUNCTION subxacts() returns void as $$ + BEGIN + FOR i in 1 .. 65 LOOP + BEGIN + INSERT INTO harvest VALUES (42); + EXCEPTION + WHEN OTHERS THEN + RAISE; + END; + END LOOP; + END; $$LANGUAGE 'plpgsql'; +} + +teardown +{ + DROP TABLE IF EXISTS harvest; + SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot'); +} + +session "s0" +setup { SET synchronous_commit=on; } +step "s0_begin" { BEGIN; } +step "s0_first_subxact" { + DO LANGUAGE plpgsql $$ + BEGIN + BEGIN + INSERT INTO harvest VALUES (41); + EXCEPTION WHEN OTHERS THEN RAISE; + END; + END $$; +} +step "s0_many_subxacts" { select subxacts(); } +step "s0_commit" { COMMIT; } + +session "s1" +setup { SET synchronous_commit=on; } +step "s1_begin" { BEGIN; } +step "s1_dml" { INSERT INTO harvest VALUES (43); } +step "s1_commit" { COMMIT; } + +session "s2" +setup { SET synchronous_commit=on; } +step "s2_checkpoint" { CHECKPOINT; } +step "s2_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0','skip-empty-xacts', '1'); } + +permutation "s0_begin" "s0_first_subxact" "s2_checkpoint" "s1_begin" "s1_dml" "s0_many_subxacts" "s0_commit" "s2_checkpoint""s2_get_changes" "s1_commit" "s2_get_changes" -- cheers, Arseny
On Mon, Feb 10, 2020 at 6:34 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > >> I don't believe you can that without persisting additional > >> data. Basically, what we need is list of transactions who are running at > >> the point of snapshot serialization *and* already wrote something before > >> it -- those we hadn't seen in full and can't decode. We have no such > >> data currently. The closest thing we have is xl_running_xacts->nextXid, > >> but > >> > >> 1) issued xid doesn't necessarily means xact actually wrote something, > >> so we can't just skip xl_xact_assignment for xid < nextXid, it might > >> still be decoded > >> 2) snapshot might be serialized not at xl_running_xacts anyway > >> > >> Surely this thing doesn't deserve changing persisted data format. > >> > > > > I agree that it won't be a good idea to change the persisted data > > format, especially in back-branches. I don't see any fix which can > > avoid this without doing major changes in the code. Apart from this, > > we have to come up with a solution for point (3) discussed in the > > above email [1] which again could be change in design. I think we can > > first try to proceed with the patch > > 0002-Stop-demanding-that-top-xact-must-be-seen-before--v2 and then we > > can discuss the other patch. I can't see a way to write a test case > > for this, can you think of any way? > > Yeah, let's finally get it. > > Attached is raw version of isolation test triggering false > 'subtransaction logged without...' (case (1)). > This didn't reproduce the desired error for me (tried without a patch). I think you need to add two more steps ("s2_checkpoint" "s2_get_changes") at the end of the test to set the restart_lsn at the appropriate location. > However, frankly I don't > see much value in it, so I'm dubious whether it should be included in > the patch. > I think this will surely test some part of the system which was not tested before, mainly having some subxacts without top-xact getting decoded even though we don't need to send such a transaction. Can you prepare a complete patch (for Stop-demanding-that-top-xact-must-be-seen-before-sub) having this test as part of it? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: >> Attached is raw version of isolation test triggering false >> 'subtransaction logged without...' (case (1)). >> > > This didn't reproduce the desired error for me (tried without a > patch). I think you need to add two more steps ("s2_checkpoint" > "s2_get_changes") at the end of the test to set the restart_lsn at the > appropriate location. That's weird, it reliably fails with expected error for me. There are already two s2_checkpoint's: first establishes potential (broken) restart_lsn (serializes snapshot after first xl_xact_assignment of s0 xact, but before first record of s1 xact), the second ensures s2_get_changes directly following it will actually advance the slot, making that potential restart_lsn real. I don't see how adding s2_checkpoint and s2_get_changes helps here. Do they really provoke error in your setup? Could you check with pg_waldump what's going on? > >> However, frankly I don't >> see much value in it, so I'm dubious whether it should be included in >> the patch. >> > > I think this will surely test some part of the system which was not > tested before, mainly having some subxacts without top-xact getting > decoded even though we don't need to send such a transaction. Can you > prepare a complete patch (for > Stop-demanding-that-top-xact-must-be-seen-before-sub) having this test > as part of it? Ok, will do. -- cheers, arseny
>> I think this will surely test some part of the system which was not >> tested before, mainly having some subxacts without top-xact getting >> decoded even though we don't need to send such a transaction. Can you >> prepare a complete patch (for >> Stop-demanding-that-top-xact-must-be-seen-before-sub) having this test >> as part of it? > > Ok, will do. Here it is. From 3748a87618dfb7379565884655f4c91b22724595 Mon Sep 17 00:00:00 2001 From: Arseny Sher <sher-ars@yandex.ru> Date: Wed, 23 Oct 2019 15:56:46 +0300 Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact in decoding. Manifested as ERROR: subtransaction logged without previous top-level txn record this check forbids legit behaviours like - First xl_xact_assignment record is beyond reading, i.e. earlier restart_lsn. - After restart_lsn there is some change of a subxact. - After that, there is second xl_xact_assignment (for another subxact) revealing relationship between top and first subxact. Such transaction won't be streamed anyway because we hadn't seen it in full; confirmed_flush_lsn must be past all these records. Saying for sure whether xact of some record encountered after snapshot was deserialized can be streamed or not requires to know whether it wrote something before deserialization point -- if yes, it hasn't been seen in full and can't be decoded. Snapshot doesn't have such info, so there is no easy way to relax the check. --- contrib/test_decoding/Makefile | 2 +- .../test_decoding/expected/subxact_without_top.out | 40 +++++++++++++ .../test_decoding/specs/subxact_without_top.spec | 65 ++++++++++++++++++++++ src/backend/replication/logical/reorderbuffer.c | 3 - 4 files changed, 106 insertions(+), 4 deletions(-) create mode 100644 contrib/test_decoding/expected/subxact_without_top.out create mode 100644 contrib/test_decoding/specs/subxact_without_top.spec diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile index 4afb1d963e..f439c582a5 100644 --- a/contrib/test_decoding/Makefile +++ b/contrib/test_decoding/Makefile @@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \ decoding_into_rel binary prepared replorigin time messages \ spill slot truncate ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \ - oldest_xmin snapshot_transfer + oldest_xmin snapshot_transfer subxact_without_top REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf diff --git a/contrib/test_decoding/expected/subxact_without_top.out b/contrib/test_decoding/expected/subxact_without_top.out new file mode 100644 index 0000000000..357e0fd964 --- /dev/null +++ b/contrib/test_decoding/expected/subxact_without_top.out @@ -0,0 +1,40 @@ +Parsed test spec with 3 sessions + +starting permutation: s0_begin s0_first_subxact s2_checkpoint s1_begin s1_dml s0_many_subxacts s0_commit s2_checkpoint s2_get_changes_suppress_outputs1_commit s2_get_changes +step s0_begin: BEGIN; +step s0_first_subxact: + DO LANGUAGE plpgsql $$ + BEGIN + BEGIN + INSERT INTO harvest VALUES (41); + EXCEPTION WHEN OTHERS THEN RAISE; + END; + END $$; + +step s2_checkpoint: CHECKPOINT; +step s1_begin: BEGIN; +step s1_dml: INSERT INTO harvest VALUES (43); +step s0_many_subxacts: select subxacts(); +subxacts + + +step s0_commit: COMMIT; +step s2_checkpoint: CHECKPOINT; +step s2_get_changes_suppress_output: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids','0', 'skip-empty-xacts', '1') LIMIT 5; +data + +BEGIN +table public.harvest: INSERT: apples[integer]:41 +table public.harvest: INSERT: apples[integer]:42 +table public.harvest: INSERT: apples[integer]:42 +table public.harvest: INSERT: apples[integer]:42 +step s1_commit: COMMIT; +step s2_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts','1'); +data + +BEGIN +table public.harvest: INSERT: apples[integer]:43 +COMMIT +?column? + +stop diff --git a/contrib/test_decoding/specs/subxact_without_top.spec b/contrib/test_decoding/specs/subxact_without_top.spec new file mode 100644 index 0000000000..1e78c2e8bf --- /dev/null +++ b/contrib/test_decoding/specs/subxact_without_top.spec @@ -0,0 +1,65 @@ +# Forces decoding session to see (and associate with toplevel before commit) +# subxact with previous top records (first xl_xact_assignment) beyond +# restart_lsn. Such partially seen xact won't be streamed as it must finish +# before confirmed_flush_lsn, but nevertheless this harmless schedule used to +# cause +# ERROR: subtransaction logged without previous top-level txn record +# failures. + +setup +{ + SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write inxact + CREATE TABLE harvest(apples integer); + CREATE OR REPLACE FUNCTION subxacts() returns void as $$ + BEGIN + FOR i in 1 .. 128 LOOP + BEGIN + INSERT INTO harvest VALUES (42); + EXCEPTION + WHEN OTHERS THEN + RAISE; + END; + END LOOP; + END; $$LANGUAGE 'plpgsql'; +} + +teardown +{ + DROP TABLE IF EXISTS harvest; + SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot'); +} + +session "s0" +setup { SET synchronous_commit=on; } +step "s0_begin" { BEGIN; } +step "s0_first_subxact" { + DO LANGUAGE plpgsql $$ + BEGIN + BEGIN + INSERT INTO harvest VALUES (41); + EXCEPTION WHEN OTHERS THEN RAISE; + END; + END $$; +} +step "s0_many_subxacts" { select subxacts(); } +step "s0_commit" { COMMIT; } + +session "s1" +setup { SET synchronous_commit=on; } +step "s1_begin" { BEGIN; } +step "s1_dml" { INSERT INTO harvest VALUES (43); } +step "s1_commit" { COMMIT; } + +session "s2" +setup { SET synchronous_commit=on; } +step "s2_checkpoint" { CHECKPOINT; } +step "s2_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0','skip-empty-xacts', '1'); } +step "s2_get_changes_suppress_output" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids','0', 'skip-empty-xacts', '1') LIMIT 5; } + +# First s2_checkpoint serializes snapshot (creates potential restart_lsn point) +# *after* initial subxact, i.e. first xl_xact_assignment is beyond it. +# s0_many_subxacts forces second xl_xact_assignment (issues > +# PGPROC_MAX_CACHED_SUBXIDS) to associate subxact with toplevel before commit. +# Second s2_checkpoint ensures s2_get_changes directly following it will +# advance the slot, establishing that restart_lsn. +permutation "s0_begin" "s0_first_subxact" "s2_checkpoint" "s1_begin" "s1_dml" "s0_many_subxacts" "s0_commit" "s2_checkpoint""s2_get_changes_suppress_output" "s1_commit" "s2_get_changes" diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index aeebbf243a..481277a1fd 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -838,9 +838,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid, txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true); subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false); - if (new_top && !new_sub) - elog(ERROR, "subtransaction logged without previous top-level txn record"); - if (!new_sub) { if (rbtxn_is_known_subxact(subtxn)) -- 2.11.0 -- cheers, arseny
On Tue, Feb 11, 2020 at 10:02 AM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > >> Attached is raw version of isolation test triggering false > >> 'subtransaction logged without...' (case (1)). > >> > > > > This didn't reproduce the desired error for me (tried without a > > patch). I think you need to add two more steps ("s2_checkpoint" > > "s2_get_changes") at the end of the test to set the restart_lsn at the > > appropriate location. > > That's weird, it reliably fails with expected error for me. There are > already two s2_checkpoint's: first establishes potential (broken) > restart_lsn (serializes snapshot after first xl_xact_assignment of s0 > xact, but before first record of s1 xact), the second ensures > s2_get_changes directly following it will actually advance the slot, > In my case, s2_get_changes doesn't seem to be advancing the restart lsn because when it processed running_xact by s2_checkpoint, the slots confirm flush location (slot->data.confirmed_flush) was behind it. As confirmed_flush was behind running_xact of s2_checkpoint, it couldn't update slot->candidate_restart_lsn (in function LogicalIncreaseRestartDecodingForSlot). I think the confirmed_flush location will only be updated at the end of get_changes. This is the reason I need extra get_changes call to generate an error. I will think and investigate this more, but thought of sharing the current situation with you. There is something different going on in my system or maybe the nature of test is like that. > making that potential restart_lsn real. > > I don't see how adding s2_checkpoint and s2_get_changes helps here. Do > they really provoke error in your setup? > Yes, I am running each of the steps in test manually by using three different terminals. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: >> That's weird, it reliably fails with expected error for me. There are >> already two s2_checkpoint's: first establishes potential (broken) >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0 >> xact, but before first record of s1 xact), the second ensures >> s2_get_changes directly following it will actually advance the slot, >> > > In my case, s2_get_changes doesn't seem to be advancing the restart > lsn because when it processed running_xact by s2_checkpoint, the slots > confirm flush location (slot->data.confirmed_flush) was behind it. As > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't > update slot->candidate_restart_lsn (in function > LogicalIncreaseRestartDecodingForSlot). I think the confirmed_flush > location will only be updated at the end of get_changes. This is the > reason I need extra get_changes call to generate an error. > > I will think and investigate this more, but thought of sharing the > current situation with you. There is something different going on in > my system or maybe the nature of test is like that. Ah, I think I know what's happening -- you have one more xl_running_xacts which catches the advancement -- similar issue is explained in the comment in oldest_xmin.spec. Try attached. From bd9e50d919cc07ec30f06c443d142f5f8510887c Mon Sep 17 00:00:00 2001 From: Arseny Sher <sher-ars@yandex.ru> Date: Wed, 23 Oct 2019 15:56:46 +0300 Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact in decoding. Manifested as ERROR: subtransaction logged without previous top-level txn record this check forbids legit behaviours like - First xl_xact_assignment record is beyond reading, i.e. earlier restart_lsn. - After restart_lsn there is some change of a subxact. - After that, there is second xl_xact_assignment (for another subxact) revealing relationship between top and first subxact. Such transaction won't be streamed anyway because we hadn't seen it in full; confirmed_flush_lsn must be past all these records. Saying for sure whether xact of some record encountered after snapshot was deserialized can be streamed or not requires to know whether it wrote something before deserialization point -- if yes, it hasn't been seen in full and can't be decoded. Snapshot doesn't have such info, so there is no easy way to relax the check. --- contrib/test_decoding/Makefile | 2 +- .../test_decoding/expected/subxact_without_top.out | 39 +++++++++++++ .../test_decoding/specs/subxact_without_top.spec | 67 ++++++++++++++++++++++ src/backend/replication/logical/reorderbuffer.c | 3 - 4 files changed, 107 insertions(+), 4 deletions(-) create mode 100644 contrib/test_decoding/expected/subxact_without_top.out create mode 100644 contrib/test_decoding/specs/subxact_without_top.spec diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile index 4afb1d963e..f439c582a5 100644 --- a/contrib/test_decoding/Makefile +++ b/contrib/test_decoding/Makefile @@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \ decoding_into_rel binary prepared replorigin time messages \ spill slot truncate ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \ - oldest_xmin snapshot_transfer + oldest_xmin snapshot_transfer subxact_without_top REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf diff --git a/contrib/test_decoding/expected/subxact_without_top.out b/contrib/test_decoding/expected/subxact_without_top.out new file mode 100644 index 0000000000..99ce998822 --- /dev/null +++ b/contrib/test_decoding/expected/subxact_without_top.out @@ -0,0 +1,39 @@ +Parsed test spec with 3 sessions + +starting permutation: s0_begin s0_first_subxact s2_checkpoint s1_begin s1_dml s0_many_subxacts s0_commit s2_checkpoint s2_get_changes_suppress_outputs2_get_changes_suppress_output s1_commit s2_get_changes +step s0_begin: BEGIN; +step s0_first_subxact: + DO LANGUAGE plpgsql $$ + BEGIN + BEGIN + INSERT INTO harvest VALUES (41); + EXCEPTION WHEN OTHERS THEN RAISE; + END; + END $$; + +step s2_checkpoint: CHECKPOINT; +step s1_begin: BEGIN; +step s1_dml: INSERT INTO harvest VALUES (43); +step s0_many_subxacts: select subxacts(); +subxacts + + +step s0_commit: COMMIT; +step s2_checkpoint: CHECKPOINT; +step s2_get_changes_suppress_output: SELECT null n FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids','0', 'skip-empty-xacts', '1') GROUP BY n; +n + + +step s2_get_changes_suppress_output: SELECT null n FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids','0', 'skip-empty-xacts', '1') GROUP BY n; +n + +step s1_commit: COMMIT; +step s2_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts','1'); +data + +BEGIN +table public.harvest: INSERT: apples[integer]:43 +COMMIT +?column? + +stop diff --git a/contrib/test_decoding/specs/subxact_without_top.spec b/contrib/test_decoding/specs/subxact_without_top.spec new file mode 100644 index 0000000000..6e9984da62 --- /dev/null +++ b/contrib/test_decoding/specs/subxact_without_top.spec @@ -0,0 +1,67 @@ +# Forces decoding session to see (and associate with toplevel before commit) +# subxact with previous top records (first xl_xact_assignment) beyond +# restart_lsn. Such partially seen xact won't be streamed as it must finish +# before confirmed_flush_lsn, but nevertheless this harmless schedule used to +# cause +# ERROR: subtransaction logged without previous top-level txn record +# failures. + +setup +{ + SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write inxact + CREATE TABLE harvest(apples integer); + CREATE OR REPLACE FUNCTION subxacts() returns void as $$ + BEGIN + FOR i in 1 .. 128 LOOP + BEGIN + INSERT INTO harvest VALUES (42); + EXCEPTION + WHEN OTHERS THEN + RAISE; + END; + END LOOP; + END; $$LANGUAGE 'plpgsql'; +} + +teardown +{ + DROP TABLE IF EXISTS harvest; + SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot'); +} + +session "s0" +setup { SET synchronous_commit=on; } +step "s0_begin" { BEGIN; } +step "s0_first_subxact" { + DO LANGUAGE plpgsql $$ + BEGIN + BEGIN + INSERT INTO harvest VALUES (41); + EXCEPTION WHEN OTHERS THEN RAISE; + END; + END $$; +} +step "s0_many_subxacts" { select subxacts(); } +step "s0_commit" { COMMIT; } + +session "s1" +setup { SET synchronous_commit=on; } +step "s1_begin" { BEGIN; } +step "s1_dml" { INSERT INTO harvest VALUES (43); } +step "s1_commit" { COMMIT; } + +session "s2" +setup { SET synchronous_commit=on; } +step "s2_checkpoint" { CHECKPOINT; } +step "s2_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0','skip-empty-xacts', '1'); } +step "s2_get_changes_suppress_output" { SELECT null n FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids','0', 'skip-empty-xacts', '1') GROUP BY n; } + +# First s2_checkpoint serializes snapshot (creates potential restart_lsn point) +# *after* initial subxact, i.e. first xl_xact_assignment is beyond it. +# s0_many_subxacts forces second xl_xact_assignment (issues > +# PGPROC_MAX_CACHED_SUBXIDS) to associate subxact with toplevel before commit. +# Second s2_checkpoint ensures s2_get_changes directly following it will +# advance the slot, establishing that restart_lsn. +# We do get_changes twice because if one more xl_running_xacts record had slipped +# before our CHECKPOINT, confirmed_flush will be advanced only to that record. +permutation "s0_begin" "s0_first_subxact" "s2_checkpoint" "s1_begin" "s1_dml" "s0_many_subxacts" "s0_commit" "s2_checkpoint""s2_get_changes_suppress_output" "s2_get_changes_suppress_output" "s1_commit" "s2_get_changes" diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index aeebbf243a..481277a1fd 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -838,9 +838,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid, txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true); subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false); - if (new_top && !new_sub) - elog(ERROR, "subtransaction logged without previous top-level txn record"); - if (!new_sub) { if (rbtxn_is_known_subxact(subtxn)) -- 2.11.0 -- cheers, arseny
On Tue, Feb 11, 2020 at 5:36 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > >> That's weird, it reliably fails with expected error for me. There are > >> already two s2_checkpoint's: first establishes potential (broken) > >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0 > >> xact, but before first record of s1 xact), the second ensures > >> s2_get_changes directly following it will actually advance the slot, > >> > > > > In my case, s2_get_changes doesn't seem to be advancing the restart > > lsn because when it processed running_xact by s2_checkpoint, the slots > > confirm flush location (slot->data.confirmed_flush) was behind it. As > > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't > > update slot->candidate_restart_lsn (in function > > LogicalIncreaseRestartDecodingForSlot). I think the confirmed_flush > > location will only be updated at the end of get_changes. This is the > > reason I need extra get_changes call to generate an error. > > > > I will think and investigate this more, but thought of sharing the > > current situation with you. There is something different going on in > > my system or maybe the nature of test is like that. > > Ah, I think I know what's happening -- you have one more > xl_running_xacts which catches the advancement -- similar issue is > explained in the comment in oldest_xmin.spec. > There is one more inconsistency in the test case which I faced while trying to reproduce. The problem is that, after "s0_begin" "s0_first_subxact", steps the open transaction is the top-transaction because we have generated the sub-transaction and closed it. Now, during the "s0_many_subxacts" step, while scanning the system table (e.g. for finding the function) the top-transaction might log the WAL for the hint bits. And then we will lose the purpose of the test because we will get the WAL for the top-transaction, after the restart point and then there will be no error. For fixing this I have modified the "s0_first_subxact" as shown below +step "s0_first_subxact" { + DO LANGUAGE plpgsql $$ + BEGIN + BEGIN + INSERT INTO harvest VALUES (41); + EXCEPTION WHEN OTHERS THEN RAISE; + END; + END $$; +} savepoint s1; -- added extra INSERT INTO harvest VALUES (41); --added extra Basically, after these two steps, the open transaction will be the sub-transaction, not the top-transaction and that will make sure that even if the future steps log some WAL then those will be under the sub-transaction, not the top-transaction. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 12, 2020 at 8:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Feb 11, 2020 at 5:36 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > > >> That's weird, it reliably fails with expected error for me. There are > > >> already two s2_checkpoint's: first establishes potential (broken) > > >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0 > > >> xact, but before first record of s1 xact), the second ensures > > >> s2_get_changes directly following it will actually advance the slot, > > >> > > > > > > In my case, s2_get_changes doesn't seem to be advancing the restart > > > lsn because when it processed running_xact by s2_checkpoint, the slots > > > confirm flush location (slot->data.confirmed_flush) was behind it. As > > > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't > > > update slot->candidate_restart_lsn (in function > > > LogicalIncreaseRestartDecodingForSlot). I think the confirmed_flush > > > location will only be updated at the end of get_changes. This is the > > > reason I need extra get_changes call to generate an error. > > > > > > I will think and investigate this more, but thought of sharing the > > > current situation with you. There is something different going on in > > > my system or maybe the nature of test is like that. > > > > Ah, I think I know what's happening -- you have one more > > xl_running_xacts which catches the advancement -- similar issue is > > explained in the comment in oldest_xmin.spec. > > Right, that is why in my case get_changes were required twice. After calling get_changes as we do in oldest_xmin.spec will make test case reliable. > There is one more inconsistency in the test case which I faced while > trying to reproduce. The problem is that, after "s0_begin" > "s0_first_subxact", steps the open transaction is the top-transaction > because we have generated the sub-transaction and closed it. Now, > during the "s0_many_subxacts" step, while scanning the system table > (e.g. for finding the function) the top-transaction might log the WAL > for the hint bits. > I am curious to know how this is happening in your case? Because we log WAL for hint-bits only when checksums or wal_log_hints are enabled (See (or XLogHintBitIsNeeded) which is not the default case? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 12, 2020 at 9:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 12, 2020 at 8:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Feb 11, 2020 at 5:36 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > > > > > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > > > > >> That's weird, it reliably fails with expected error for me. There are > > > >> already two s2_checkpoint's: first establishes potential (broken) > > > >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0 > > > >> xact, but before first record of s1 xact), the second ensures > > > >> s2_get_changes directly following it will actually advance the slot, > > > >> > > > > > > > > In my case, s2_get_changes doesn't seem to be advancing the restart > > > > lsn because when it processed running_xact by s2_checkpoint, the slots > > > > confirm flush location (slot->data.confirmed_flush) was behind it. As > > > > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't > > > > update slot->candidate_restart_lsn (in function > > > > LogicalIncreaseRestartDecodingForSlot). I think the confirmed_flush > > > > location will only be updated at the end of get_changes. This is the > > > > reason I need extra get_changes call to generate an error. > > > > > > > > I will think and investigate this more, but thought of sharing the > > > > current situation with you. There is something different going on in > > > > my system or maybe the nature of test is like that. > > > > > > Ah, I think I know what's happening -- you have one more > > > xl_running_xacts which catches the advancement -- similar issue is > > > explained in the comment in oldest_xmin.spec. > > > > > Right, that is why in my case get_changes were required twice. After > calling get_changes as we do in oldest_xmin.spec will make test case > reliable. > > > There is one more inconsistency in the test case which I faced while > > trying to reproduce. The problem is that, after "s0_begin" > > "s0_first_subxact", steps the open transaction is the top-transaction > > because we have generated the sub-transaction and closed it. Now, > > during the "s0_many_subxacts" step, while scanning the system table > > (e.g. for finding the function) the top-transaction might log the WAL > > for the hint bits. > > > > I am curious to know how this is happening in your case? Because we > log WAL for hint-bits only when checksums or wal_log_hints are enabled > (See (or XLogHintBitIsNeeded) which is not the default case? Yeah, you are right. Actually, wal_log_hints is set in my configuration. So it should not be a problem. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 12, 2020 at 9:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 12, 2020 at 8:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Feb 11, 2020 at 5:36 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > > > > > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > > > > >> That's weird, it reliably fails with expected error for me. There are > > > >> already two s2_checkpoint's: first establishes potential (broken) > > > >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0 > > > >> xact, but before first record of s1 xact), the second ensures > > > >> s2_get_changes directly following it will actually advance the slot, > > > >> > > > > > > > > In my case, s2_get_changes doesn't seem to be advancing the restart > > > > lsn because when it processed running_xact by s2_checkpoint, the slots > > > > confirm flush location (slot->data.confirmed_flush) was behind it. As > > > > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't > > > > update slot->candidate_restart_lsn (in function > > > > LogicalIncreaseRestartDecodingForSlot). I think the confirmed_flush > > > > location will only be updated at the end of get_changes. This is the > > > > reason I need extra get_changes call to generate an error. > > > > > > > > I will think and investigate this more, but thought of sharing the > > > > current situation with you. There is something different going on in > > > > my system or maybe the nature of test is like that. > > > > > > Ah, I think I know what's happening -- you have one more > > > xl_running_xacts which catches the advancement -- similar issue is > > > explained in the comment in oldest_xmin.spec. > > > > > Right, that is why in my case get_changes were required twice. After > calling get_changes as we do in oldest_xmin.spec will make test case > reliable. > Attached is a patch where I have modified the comments and slightly edited the commit message. This patch was not getting applied in v11 and branches lower than that, so I prepared a patch for those branches as well. I have tested this patch till 9.5 and it works as intended. Can you also once check the patch and verify it in back-branches? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Feb 12, 2020 at 1:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 12, 2020 at 9:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Feb 12, 2020 at 8:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Tue, Feb 11, 2020 at 5:36 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > > > > > > > > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > > > > > > >> That's weird, it reliably fails with expected error for me. There are > > > > >> already two s2_checkpoint's: first establishes potential (broken) > > > > >> restart_lsn (serializes snapshot after first xl_xact_assignment of s0 > > > > >> xact, but before first record of s1 xact), the second ensures > > > > >> s2_get_changes directly following it will actually advance the slot, > > > > >> > > > > > > > > > > In my case, s2_get_changes doesn't seem to be advancing the restart > > > > > lsn because when it processed running_xact by s2_checkpoint, the slots > > > > > confirm flush location (slot->data.confirmed_flush) was behind it. As > > > > > confirmed_flush was behind running_xact of s2_checkpoint, it couldn't > > > > > update slot->candidate_restart_lsn (in function > > > > > LogicalIncreaseRestartDecodingForSlot). I think the confirmed_flush > > > > > location will only be updated at the end of get_changes. This is the > > > > > reason I need extra get_changes call to generate an error. > > > > > > > > > > I will think and investigate this more, but thought of sharing the > > > > > current situation with you. There is something different going on in > > > > > my system or maybe the nature of test is like that. > > > > > > > > Ah, I think I know what's happening -- you have one more > > > > xl_running_xacts which catches the advancement -- similar issue is > > > > explained in the comment in oldest_xmin.spec. > > > > > > > > Right, that is why in my case get_changes were required twice. After > > calling get_changes as we do in oldest_xmin.spec will make test case > > reliable. > > > > Attached is a patch where I have modified the comments and slightly > edited the commit message. This patch was not getting applied in v11 > and branches lower than that, so I prepared a patch for those branches > as well. I have tested this patch till 9.5 and it works as intended. > > Can you also once check the patch and verify it in back-branches? I have checked the patch and it looks fine to me. I have also tested it on the back branches and it works fine. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Feb 13, 2020 at 10:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Feb 12, 2020 at 1:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Attached is a patch where I have modified the comments and slightly > > edited the commit message. This patch was not getting applied in v11 > > and branches lower than that, so I prepared a patch for those branches > > as well. I have tested this patch till 9.5 and it works as intended. > > > > Can you also once check the patch and verify it in back-branches? > > I have checked the patch and it looks fine to me. I have also tested > it on the back branches and it works fine. > Thanks, I am planning to commit this next week sometime (on Wednesday 19-Feb) to let others review or share their opinion. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > Attached is a patch where I have modified the comments and slightly > edited the commit message. This patch was not getting applied in v11 > and branches lower than that, so I prepared a patch for those branches > as well. I have tested this patch till 9.5 and it works as intended. > > Can you also once check the patch and verify it in back-branches? Sorry for the delay. I'm fine with changes. make check down to 9.5 is also happy here. Interesting thing about wal_log_hints, hasn't occured to me. -- cheers, arseny
On Fri, Feb 14, 2020 at 7:04 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > Attached is a patch where I have modified the comments and slightly > > edited the commit message. This patch was not getting applied in v11 > > and branches lower than that, so I prepared a patch for those branches > > as well. I have tested this patch till 9.5 and it works as intended. > > > > Can you also once check the patch and verify it in back-branches? > > Sorry for the delay. I'm fine with changes. make check down to 9.5 is > also happy here. > Pushed, let's keep an eye on buildfarm. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Feb 9, 2020 at 9:37 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > > Somehow I hadn't realized this earlier, so my comments/commit messages > in patches above were not accurate here; I've edited them. Also in the > first patch serialized snapshots are not no longer used for new slot > creation at all, as Andres suggested above. > + /* + * Don't use serialized snapshot if we are not sure where all + * currently running xacts will finish (new slot creation). + * (Actually, if we came here through xl_running_xacts, we could perform + * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly, + * but added lines of code would hardly worth the benefit.) + */ + if (builder->start_decoding_at == InvalidXLogRecPtr) + return false; Instead of using start_decoding_at to decide whether to restore snapshot or not, won't it be better to have new variable in SnapBuild (say can_use_serialized_snap or something like that) and for this purpose? I think the patch is trying to use a variable that is not meant for the purpose we are using for it, so not sure if it is the right direction for the fix. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > On Sun, Feb 9, 2020 at 9:37 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > + /* > + * Don't use serialized snapshot if we are not sure where all > + * currently running xacts will finish (new slot creation). > + * (Actually, if we came here through xl_running_xacts, we could perform > + * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly, > + * but added lines of code would hardly worth the benefit.) > + */ > + if (builder->start_decoding_at == InvalidXLogRecPtr) > + return false; > > Instead of using start_decoding_at to decide whether to restore > snapshot or not, won't it be better to have new variable in SnapBuild > (say can_use_serialized_snap or something like that) and for this > purpose? start_decoding_at who is initialized externally at AllocateSnapshotBuilder is what actually defines how to handle serialized snapshots: if it is valid LSN, snapbuild must trust the caller that WAL reading starts early enough to stream since this LSN, so we deserialize the snap and jump into CONSISTENT. If it is invalid, we don't know the safe streaming point yet, and it remains invalid until we learn full snapshot and then wait for all xacts finishing. So such bool would be a pointless synonym. Moreover, as cited comment mentions: > + * (Actually, if we came here through xl_running_xacts, we could perform > + * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly, > + * but added lines of code would hardly worth the benefit.) there is nothing wrong in using the serialized snapshot per se. It's just that we must wait for all xacts finishing after getting the snapshot and this is impossible if we don't know who is running. So can_use_serialized_snap would be even somewhat confusing. -- cheers, arseny
On Mon, Mar 2, 2020 at 1:11 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > On Sun, Feb 9, 2020 at 9:37 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > + /* > > + * Don't use serialized snapshot if we are not sure where all > > + * currently running xacts will finish (new slot creation). > > + * (Actually, if we came here through xl_running_xacts, we could perform > > + * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly, > > + * but added lines of code would hardly worth the benefit.) > > + */ > > + if (builder->start_decoding_at == InvalidXLogRecPtr) > > + return false; > > > > Instead of using start_decoding_at to decide whether to restore > > snapshot or not, won't it be better to have new variable in SnapBuild > > (say can_use_serialized_snap or something like that) and for this > > purpose? > > start_decoding_at who is initialized externally at > AllocateSnapshotBuilder is what actually defines how to handle > serialized snapshots: if it is valid LSN, snapbuild must trust the > caller that WAL reading starts early enough to stream since this LSN, so > we deserialize the snap and jump into CONSISTENT. If it is invalid, we > don't know the safe streaming point yet, and it remains invalid until we > learn full snapshot and then wait for all xacts finishing. > I think here you are trying to deduce the meaning. I don't see that it can clearly define that don't use serialized snapshots. It is not clear to me why have you changed the below code, basically why it is okay to pass InvalidXLogRecPtr instead of restart_lsn? @@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin, ReplicationSlotMarkDirty(); ReplicationSlotSave(); - ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon, + ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon, need_full_snapshot, false, read_page, prepare_write, do_write, update_progress); > So such bool > would be a pointless synonym. > > Moreover, as cited comment mentions: > > > + * (Actually, if we came here through xl_running_xacts, we could perform > > + * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly, > > + * but added lines of code would hardly worth the benefit.) > > there is nothing wrong in using the serialized snapshot per se. It's > just that we must wait for all xacts finishing after getting the > snapshot and this is impossible if we don't know who is running. > I am not denying any such possibility and even if we do something like that it will be for the master branch. > So > can_use_serialized_snap would be even somewhat confusing. > It is possible that your idea of trying to deduce from start_decoding_at is better, but I am not sure about it if anyone else also thinks that is a good way to do, then fine. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > I think here you are trying to deduce the meaning. I don't see that > it can clearly define that don't use serialized snapshots. It is not > clear to me why have you changed the below code, basically why it is > okay to pass InvalidXLogRecPtr instead of restart_lsn? > > @@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin, > ReplicationSlotMarkDirty(); > ReplicationSlotSave(); > > - ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon, > + ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon, > need_full_snapshot, false, > read_page, prepare_write, do_write, > update_progress); Because when we create the slot we don't demand to stream from some specific point. In fact we just can't, because we don't know since which LSN it is actually possible to stream, i.e. when we'd have good snapshot and no old (which we haven't seen in full) xacts running. It is up to snapbuild.c to define this point. The previous coding was meaningless: we asked for some random restart_lsn and snapbuild.c would silently advance it to earliest suitable LSN. OTOH, when we are decoding from existing slot not only we know earliest possible point, but to avoid missing xacts we must enforce streaming since this very point despite the snapbuilder being unable (because he might not know which xacts are running at point of the snapshot) to check its safety. start_decoding_at reflects the difference between these scenarios, and serialized snapshots handling stems from here. Thanks for looking into this. -- cheers, arseny
On Mon, Mar 2, 2020 at 7:07 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > I think here you are trying to deduce the meaning. I don't see that > > it can clearly define that don't use serialized snapshots. It is not > > clear to me why have you changed the below code, basically why it is > > okay to pass InvalidXLogRecPtr instead of restart_lsn? > > > > @@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin, > > ReplicationSlotMarkDirty(); > > ReplicationSlotSave(); > > > > - ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon, > > + ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon, > > need_full_snapshot, false, > > read_page, prepare_write, do_write, > > update_progress); > > Because when we create the slot we don't demand to stream from some > specific point. In fact we just can't, because we don't know since which > LSN it is actually possible to stream, i.e. when we'd have good snapshot > and no old (which we haven't seen in full) xacts running. It is up to > snapbuild.c to define this point. The previous coding was meaningless: > we asked for some random restart_lsn and snapbuild.c would silently > advance it to earliest suitable LSN. > Hmm, if this is the case then it should be true even without solving this particular problem and we should be able to make this change. Leaving that aside, I think this change can make copy replication slot functionality to also skip using serialized snapshots with this patch which is not our intention. Also, it doesn't seem like a good idea to ignore setting start_decoding_at when we already set slot->data.restart_lsn with this value. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: >> Because when we create the slot we don't demand to stream from some >> specific point. In fact we just can't, because we don't know since which >> LSN it is actually possible to stream, i.e. when we'd have good snapshot >> and no old (which we haven't seen in full) xacts running. It is up to >> snapbuild.c to define this point. The previous coding was meaningless: >> we asked for some random restart_lsn and snapbuild.c would silently >> advance it to earliest suitable LSN. >> > > Hmm, if this is the case then it should be true even without solving > this particular problem and we should be able to make this change. Right. > Leaving that aside, I think this change can make copy replication slot > functionality to also skip using serialized snapshots with this patch > which is not our intention. As I say at [1] logical slot copying facility is currently anyway broken in this regard: restart_lsn is copied, but confirmed_flush isn't, and the right fix, in my view, is to avoid DecodingContextFindStartpoint there altogether (by checking donor's confirmed_flush is valid and copying it) which would render this irrelevant. To speculate, even if wanted to go through DecodingContextFindStartpoint for slot copying and establish confirmed_flush on our own, surely we'd need to handle serialized snapshots exactly as new slot creation does because dangers of getting SNAPBUILD_CONSISTENT too early are the same in both cases. > Also, it doesn't seem like a good idea to ignore setting > start_decoding_at when we already set slot->data.restart_lsn with this > value. Well, these two fields have absolutely different values. BTW I find the naming here somewhat unfortunate, and this phrase suggests that it indeed leads to confusion. Slot's restart_lsn is the LSN since which we start reading WAL and by setting data.restart_lsn we prevent WAL we need from recycling. start_decoding_at is the LSN since which we start *streaming*, i.e. actually replaying COMMITs. So setting the first one (as we must hold WAL) and not the second one (as we don't know the streaming point yet when we start slot creation) is just fine. [1] https://www.postgresql.org/message-id/flat/CA%2Bfd4k70BXLTm-N6q18LrL%3DGbKtwY3-2%2B%2BUVFw05SvFTkZgTyQ%40mail.gmail.com -- cheers, arseny