Re: logical decoding and replication of sequences - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical decoding and replication of sequences
Date
Msg-id 00708727-d856-1886-48e3-811296c7ba8c@enterprisedb.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: logical decoding and replication of sequences  (Amit Kapila <amit.kapila16@gmail.com>)
Re: logical decoding and replication of sequences  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On 4/1/22 17:02, Tomas Vondra wrote:
> 
> 
> On 3/28/22 07:29, Amit Kapila wrote:
>> ...
>>
>> While thinking about this, I think I see a problem with the
>> non-transactional handling of sequences. It seems that we will skip
>> sending non-transactional sequence change if it occurs before the
>> decoding has reached a consistent point but the surrounding commit
>> occurs after a consistent point is reached. In such cases, the
>> corresponding DMLs like inserts will be sent but sequence changes
>> won't be sent. For example (this scenario is based on
>> twophase_snapshot.spec),
>>
>> Initial setup:
>> ==============
>> Create table t1_seq(c1 int);
>> Create Sequence seq1;
>>
>> Test Execution via multiple sessions (this test allows insert in
>> session-2 to happen before we reach a consistent point and commit
>> happens after a consistent point):
>> =======================================================================================================
>>
>> Session-2:
>> Begin;
>> SELECT pg_current_xact_id();
>>
>> Session-1:
>> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
>> 'test_decoding', false, true);
>>
>> Session-3:
>> Begin;
>> SELECT pg_current_xact_id();
>>
>> Session-2:
>> Commit;
>> Begin;
>> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
>>
>> Session-3:
>> Commit;
>>
>> Session-2:
>> Commit 'foo'
>>
>> Session-1:
>> SELECT data  FROM pg_logical_slot_get_changes('test_slot', NULL, NULL,
>> 'include-xids', 'false', 'skip-empty-xacts', '1');
>>
>>                      data
>> ----------------------------------------------
>>  BEGIN
>>  table public.t1_seq: INSERT: c1[integer]:1
>>  table public.t1_seq: INSERT: c1[integer]:2
>>  table public.t1_seq: INSERT: c1[integer]:3
>>  table public.t1_seq: INSERT: c1[integer]:4
>>  table public.t1_seq: INSERT: c1[integer]:5
>>  table public.t1_seq: INSERT: c1[integer]:6
>>
>>
>> Now, if we normally try to decode such an insert, the result would be
>> something like:
>>                                      data
>> ------------------------------------------------------------------------------
>>  sequence public.seq1: transactional:0 last_value: 33 log_cnt: 0 is_called:1
>>  sequence public.seq1: transactional:0 last_value: 66 log_cnt: 0 is_called:1
>>  sequence public.seq1: transactional:0 last_value: 99 log_cnt: 0 is_called:1
>>  sequence public.seq1: transactional:0 last_value: 132 log_cnt: 0 is_called:1
>>  BEGIN
>>  table public.t1_seq: INSERT: c1[integer]:1
>>  table public.t1_seq: INSERT: c1[integer]:2
>>  table public.t1_seq: INSERT: c1[integer]:3
>>  table public.t1_seq: INSERT: c1[integer]:4
>>  table public.t1_seq: INSERT: c1[integer]:5
>>  table public.t1_seq: INSERT: c1[integer]:6
>>
>> This will create an inconsistent replica as sequence changes won't be
>> replicated.
> 
> Hmm, that's interesting. I wonder if it can actually happen, though.
> Have you been able to reproduce that, somehow?
> 
>> I thought about changing snapshot dealing of
>> non-transactional sequence changes similar to transactional ones but
>> that also won't work because it is only at commit we decide whether we
>> can send the changes.
>>
> I wonder if there's some earlier LSN (similar to the consistent point)
> which might be useful for this.
> 
> Or maybe we should queue even the non-transactional changes, not
> per-transaction but in a global list, and then at each commit either
> discard inspect them (at that point we know the lowest LSN for all
> transactions and the consistent point). Seems complex, though.
> 
>> For the transactional case, as we are considering the create sequence
>> operation as transactional, we would unnecessarily queue them even
>> though that is not required. Basically, they don't need to be
>> considered transactional and we can simply ignore such messages like
>> other DDLs. But for that probably we need to distinguish Alter/Create
>> case which may or may not be straightforward. Now, queuing them is
>> probably harmless unless it causes the transaction to spill/stream.
>>
> 
> I'm not sure I follow. Why would we queue them unnecessarily?
> 
> Also, there's the bug with decoding changes in transactions that create
> the sequence and add it to a publication. I think the agreement was that
> this behavior was incorrect, we should not decode changes until the
> subscription is refreshed. Doesn't that mean can't be any CREATE case,
> just ALTER?
> 

So, I investigated this a bit more, and I wrote a couple test_decoding
isolation tests (patch attached) demonstrating the issue. Actually, I
should say "issues" because it's a bit worse than you described ...

The whole problem is in this chunk of code in sequence_decode():


  /* Skip the change if already processed (per the snapshot). */
  if (transactional &&
      !SnapBuildProcessChange(builder, xid, buf->origptr))
      return;
  else if (!transactional &&
           (SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
            SnapBuildXactNeedsSkip(builder, buf->origptr)))
      return;

  /* Queue the increment (or send immediately if not transactional). */
  snapshot = SnapBuildGetOrBuildSnapshot(builder, xid);
  ReorderBufferQueueSequence(ctx->reorder, xid, snapshot, buf->endptr,
                             origin_id, target_node, transactional,
                             xlrec->created, tuplebuf);

With the script you described, the increment is non-transactional, so we
end up in the second branch, return and thus discard the increment.

But it's also possible the change is transactional, which can only
trigger the first branch. But it does not, so we start building the
snapshot. But the first thing SnapBuildGetOrBuildSnapshot does is

   Assert(builder->state == SNAPBUILD_CONSISTENT);

and we're still not in a consistent snapshot, so it just crashes and
burn :-(

The sequences.spec file has two definitions of s2restart step, one empty
(resulting in non-transactional change), one with ALTER SEQUENCE (which
means the change will be transactional).


The really "funny" thing is this is not new code - this is an exact copy
from logicalmsg_decode(), and logical messages have all those issues
too. We may discard some messages, trigger the same Assert, etc. There's
a messages2.spec demonstrating this (s2message step defines whether the
message is transactional or not).

So I guess we need to fix both places, perhaps in a similar way. And one
of those will have to be backpatched (which may make it more complex).


The only option I see is reworking the decoding so that it does not need
the snapshot at all. We'll need to stash the changes just like any other
change, apply them at end of transaction, and the main difference
between transactional and non-transactional case would be what happens
at abort. Transactional changes would be discarded, non-transactional
would be applied anyway.

The challenge is this reorders the sequence changes, so we'll need to
reconcile them somehow. One option would be to simply (1) apply the
change with the highest LSN in the transaction, and then walk all other
in-progress transactions and changes for that sequence with a lower LSN.
Not sure how complex/expensive that would be, though. Another problem is
not all increments are WAL-logged, of course, not sure about that.

Another option might be revisiting the approach proposed by Hannu in
September [1], i.e. tracking sequences touched in a transaction, and
then replicating the current state at that particular moment.



regards


[1]
https://www.postgresql.org/message-id/CAMT0RQQeDR51xs8zTa25YpfKB1B34nS-Q4hhsRPznVsjMB_P1w%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Amit Kapila
Date:
Subject: Re: Skipping logical replication transactions on subscriber side