Re: logical decoding and replication of sequences - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: logical decoding and replication of sequences |
Date | |
Msg-id | 00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com Whole thread Raw |
In response to | Re: logical decoding and replication of sequences (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: logical decoding and replication of sequences
Re: logical decoding and replication of sequences |
List | pgsql-hackers |
On 4/1/22 17:02, Tomas Vondra wrote: > > > On 3/28/22 07:29, Amit Kapila wrote: >> ... >> >> While thinking about this, I think I see a problem with the >> non-transactional handling of sequences. It seems that we will skip >> sending non-transactional sequence change if it occurs before the >> decoding has reached a consistent point but the surrounding commit >> occurs after a consistent point is reached. In such cases, the >> corresponding DMLs like inserts will be sent but sequence changes >> won't be sent. For example (this scenario is based on >> twophase_snapshot.spec), >> >> Initial setup: >> ============== >> Create table t1_seq(c1 int); >> Create Sequence seq1; >> >> Test Execution via multiple sessions (this test allows insert in >> session-2 to happen before we reach a consistent point and commit >> happens after a consistent point): >> ======================================================================================================= >> >> Session-2: >> Begin; >> SELECT pg_current_xact_id(); >> >> Session-1: >> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', >> 'test_decoding', false, true); >> >> Session-3: >> Begin; >> SELECT pg_current_xact_id(); >> >> Session-2: >> Commit; >> Begin; >> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100); >> >> Session-3: >> Commit; >> >> Session-2: >> Commit 'foo' >> >> Session-1: >> SELECT data FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, >> 'include-xids', 'false', 'skip-empty-xacts', '1'); >> >> data >> ---------------------------------------------- >> BEGIN >> table public.t1_seq: INSERT: c1[integer]:1 >> table public.t1_seq: INSERT: c1[integer]:2 >> table public.t1_seq: INSERT: c1[integer]:3 >> table public.t1_seq: INSERT: c1[integer]:4 >> table public.t1_seq: INSERT: c1[integer]:5 >> table public.t1_seq: INSERT: c1[integer]:6 >> >> >> Now, if we normally try to decode such an insert, the result would be >> something like: >> data >> ------------------------------------------------------------------------------ >> sequence public.seq1: transactional:0 last_value: 33 log_cnt: 0 is_called:1 >> sequence public.seq1: transactional:0 last_value: 66 log_cnt: 0 is_called:1 >> sequence public.seq1: transactional:0 last_value: 99 log_cnt: 0 is_called:1 >> sequence public.seq1: transactional:0 last_value: 132 log_cnt: 0 is_called:1 >> BEGIN >> table public.t1_seq: INSERT: c1[integer]:1 >> table public.t1_seq: INSERT: c1[integer]:2 >> table public.t1_seq: INSERT: c1[integer]:3 >> table public.t1_seq: INSERT: c1[integer]:4 >> table public.t1_seq: INSERT: c1[integer]:5 >> table public.t1_seq: INSERT: c1[integer]:6 >> >> This will create an inconsistent replica as sequence changes won't be >> replicated. > > Hmm, that's interesting. I wonder if it can actually happen, though. > Have you been able to reproduce that, somehow? > >> I thought about changing snapshot dealing of >> non-transactional sequence changes similar to transactional ones but >> that also won't work because it is only at commit we decide whether we >> can send the changes. >> > I wonder if there's some earlier LSN (similar to the consistent point) > which might be useful for this. > > Or maybe we should queue even the non-transactional changes, not > per-transaction but in a global list, and then at each commit either > discard inspect them (at that point we know the lowest LSN for all > transactions and the consistent point). Seems complex, though. > >> For the transactional case, as we are considering the create sequence >> operation as transactional, we would unnecessarily queue them even >> though that is not required. Basically, they don't need to be >> considered transactional and we can simply ignore such messages like >> other DDLs. But for that probably we need to distinguish Alter/Create >> case which may or may not be straightforward. Now, queuing them is >> probably harmless unless it causes the transaction to spill/stream. >> > > I'm not sure I follow. Why would we queue them unnecessarily? > > Also, there's the bug with decoding changes in transactions that create > the sequence and add it to a publication. I think the agreement was that > this behavior was incorrect, we should not decode changes until the > subscription is refreshed. Doesn't that mean can't be any CREATE case, > just ALTER? > So, I investigated this a bit more, and I wrote a couple test_decoding isolation tests (patch attached) demonstrating the issue. Actually, I should say "issues" because it's a bit worse than you described ... The whole problem is in this chunk of code in sequence_decode(): /* Skip the change if already processed (per the snapshot). */ if (transactional && !SnapBuildProcessChange(builder, xid, buf->origptr)) return; else if (!transactional && (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || SnapBuildXactNeedsSkip(builder, buf->origptr))) return; /* Queue the increment (or send immediately if not transactional). */ snapshot = SnapBuildGetOrBuildSnapshot(builder, xid); ReorderBufferQueueSequence(ctx->reorder, xid, snapshot, buf->endptr, origin_id, target_node, transactional, xlrec->created, tuplebuf); With the script you described, the increment is non-transactional, so we end up in the second branch, return and thus discard the increment. But it's also possible the change is transactional, which can only trigger the first branch. But it does not, so we start building the snapshot. But the first thing SnapBuildGetOrBuildSnapshot does is Assert(builder->state == SNAPBUILD_CONSISTENT); and we're still not in a consistent snapshot, so it just crashes and burn :-( The sequences.spec file has two definitions of s2restart step, one empty (resulting in non-transactional change), one with ALTER SEQUENCE (which means the change will be transactional). The really "funny" thing is this is not new code - this is an exact copy from logicalmsg_decode(), and logical messages have all those issues too. We may discard some messages, trigger the same Assert, etc. There's a messages2.spec demonstrating this (s2message step defines whether the message is transactional or not). So I guess we need to fix both places, perhaps in a similar way. And one of those will have to be backpatched (which may make it more complex). The only option I see is reworking the decoding so that it does not need the snapshot at all. We'll need to stash the changes just like any other change, apply them at end of transaction, and the main difference between transactional and non-transactional case would be what happens at abort. Transactional changes would be discarded, non-transactional would be applied anyway. The challenge is this reorders the sequence changes, so we'll need to reconcile them somehow. One option would be to simply (1) apply the change with the highest LSN in the transaction, and then walk all other in-progress transactions and changes for that sequence with a lower LSN. Not sure how complex/expensive that would be, though. Another problem is not all increments are WAL-logged, of course, not sure about that. Another option might be revisiting the approach proposed by Hannu in September [1], i.e. tracking sequences touched in a transaction, and then replicating the current state at that particular moment. regards [1] https://www.postgresql.org/message-id/CAMT0RQQeDR51xs8zTa25YpfKB1B34nS-Q4hhsRPznVsjMB_P1w%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: