Re: logical decoding and replication of sequences, take 2 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical decoding and replication of sequences, take 2
Date
Msg-id 110816d7-ebe1-3612-8ca7-d38489dd8e22@enterprisedb.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences, take 2  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: logical decoding and replication of sequences, take 2
Re: logical decoding and replication of sequences, take 2
List pgsql-hackers

On 11/16/22 22:05, Robert Haas wrote:
> On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> The other option might be to make these messages non-transactional, in
>> which case we'd separate the ordering from COMMIT ordering, evading the
>> reordering problem.
>>
>> That'd mean we'd ignore rollbacks (which seems fine), we could probably
>> optimize this by checking if the state actually changed, etc. But we'd
>> also need to deal with transactions created in the (still uncommitted)
>> transaction. But I'm also worried it might lead to the same issue with
>> non-transactional behaviors that forced revert in v15.
> 
> I think it might be a good idea to step back slightly from
> implementation details and try to agree on a theoretical model of
> what's happening here. Let's start by banishing the words
> transactional and non-transactional from the conversation and talk
> about what logical replication is trying to do.
> 

OK, let's try.

> We can imagine that the replicated objects on the primary pass through
> a series of states S1, S2, ..., Sn, where n keeps going up as new
> state changes occur. The state, for our purposes here, is the contents
> of the database as they could be observed by a user running SELECT
> queries at some moment in time chosen by the user. For instance, if
> the initial state of the database is S1, and then the user executes
> BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the
> state that differs from S1 in that both of those rows are now part of
> the database contents. There is no state where one of those rows is
> visible and the other is not. That was never observable by the user,
> except from within the transaction as it was executing, which we can
> and should discount. I believe that the goal of logical replication is
> to bring about a state of affairs where the set of states observable
> on the standby is a subset of the states observable on the primary.
> That is, if the primary goes from S1 to S2 to S3, the standby can do
> the same thing, or it can go straight from S1 to S3 without ever
> making it possible for the user to observe S2. Either is correct
> behavior. But the standby cannot invent any new states that didn't
> occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to
> S3, or something like that. It can only consolidate changes that
> occurred separately on the primary, never split them up. Neither can
> it reorder them.
> 

I mostly agree, and in a way the last patch aims to do roughly this,
i.e. make sure that the state after each transaction matches the state a
user might observe on the primary (modulo implementation challenges).

There's a couple of caveats, though:

1) Maybe we should focus more on "actually observed" state instead of
"observable". Who cares if the sequence moved forward in a transaction
that was ultimately rolled back? No committed transaction should have
observer those values - in a way, the last "valid" state of the sequence
is the last value generated in a transaction that ultimately committed.

2) I think what matters more is that we never generate duplicate value.
That is, if you generate a value from a sequence, commit a transaction
and replicate it, then the logical standby should not generate the same
value from the sequence. This guarantee seems necessary for "failover"
to logical standby.

> Now, if you accept this as a reasonable definition of correctness,
> then the next question is what consequences it has for transactional
> and non-transactional behavior. If all behavior is transactional, then
> we've basically got to replay each primary transaction in a single
> standby transaction, and commit those transactions in the same order
> that the corresponding primary transactions committed. We could
> legally choose to merge a group of transactions that committed one
> after the other on the primary into a single transaction on the
> standby, and it might even be a good idea if they're all very tiny,
> but it's not required. But if there are non-transactional things
> happening, then there are changes that become visible at some time
> other than at a transaction commit. For example, consider this
> sequence of events, in which each "thing" that happens is
> transactional except where the contrary is noted:
> 
> T1: BEGIN;
> T2: BEGIN;
> T1: Do thing 1;
> T2: Do thing 2;
> T1: Do a non-transactional thing;
> T1: Do thing 3;
> T2: Do thing 4;
> T2: COMMIT;
> T1: COMMIT;
> 
> From the point of the user here, there are 4 observable states here:
> 
> S1: Initiate state.
> S2: State after the non-transactional thing happens.
> S3: State after T2 commits (reflects the non-transactional thing plus
> things 2 and 4).
> S4: State after T1 commits.
> 
> Basically, the non-transactional thing behaves a whole lot like a
> separate transaction. That non-transactional operation ought to be
> replicated before T2, which ought to be replicated before T1. Maybe
> logical replication ought to treat it in exactly that way: as a
> separate operation that needs to be replicated after any earlier
> transactions that completed prior to the history shown here, but
> before T2 or T1. Alternatively, you can merge the non-transactional
> change into T2, i.e. the first transaction that committed after it
> happened. But you can't merge it into T1, even though it happened in
> T1. If you do that, then you're creating states on the standby that
> never existed on the primary, which is wrong. You could argue that
> this is just nitpicking: who cares if the change in the sequence value
> doesn't get replicated at exactly the right moment? But I don't think
> it's a technicality at all: I think if we don't make the operation
> appear to happen at the same point in the sequence as it became
> visible on the master, then there will be endless artifacts and corner
> cases to the bottom of which we will never get. Just like if we
> replicated the actual transactions out of order, chaos would ensue,
> because there can be logical dependencies between them, so too can
> there be logical dependencies between non-transactional operations, or
> between a non-transactional operation and a transactional operation.
> 

Well, yeah - we can either try to perform the stuff independently of the
transactions that triggered it, or we can try making it part of some of
the transactions. Each of those options has problems, though :-(

The first version of the patch tried the first approach, i.e. decode the
increments and apply that independently. But:

  (a) What would you do with increments of sequences created/reset in a
      transaction? Can't apply those outside the transaction, because it
      might be rolled back (and that state is not visible on primary).

  (b) What about increments created before we have a proper snapshot?
      There may be transactions dependent on the increment. This is what
      ultimately led to revert of the patch.

This version of the patch tries to do the opposite thing - make sure
that the state after each commit matches what the transaction might have
seen (for sequences it accessed). It's imperfect, because it might log a
state generated "after" the sequence got accessed - it focuses on the
guarantee not to generate duplicate values.

> To make it more concrete, consider two sessions concurrently running this SQL:
> 
> insert into t1 select nextval('s1') from generate_series(1,1000000) g;
> 
> There are, in effect, 2000002 transaction-like things here. The
> sequence gets incremented 2 million times, and then there are 2
> commits that each insert a million rows. Perhaps the actual order of
> events looks something like this:
> 
> 1. nextval the sequence N times, where N >= 1 million
> 2. commit the first transaction, adding a million rows to t1
> 3. nextval the sequence 2 million - N times
> 4. commit the second transaction, adding another million rows to t1
> 
> Unless we replicate all of the nextval operations that occur in step 1
> at the same time or prior to replicating the first transaction in step
> 2, we might end up making visible a state where the next value of the
> sequence is less than the highest value present in the table, which
> would be bad.
> 

Right, that's the "guarantee" I've mentioned above, more or less.

> With that perhaps overly-long set of preliminaries, I'm going to move
> on to talking about the implementation ideas which you mention. You
> write that "the current solution is based on WAL-logging state of all
> sequences incremented by the transaction at COMMIT" and then, it seems
> to me, go on to demonstrate that it's simply incorrect. In my opinion,
> the fundamental problem is that it doesn't look at the order that
> things happened on the primary and do them in the same order on the
> standby. Instead, it accepts that the non-transactional operations are
> going to be replicated at the wrong time, and then tries to patch
> around the issue by attempting to scrounge up the correct values at
> some convenient point and use that data to compensate for our failure
> to do the right thing at an earlier point. That doesn't seem like a
> satisfying solution, and I think it will be hard to make it fully
> correct.
> 

I understand what you're saying, but I'm not sure I agree with you.

Yes, this would mean we accept we may end up with something like this:

1: T1 logs sequence state S1
2: someone increments sequence
3: T2 logs sequence stats S2
4: T2 commits
5: T1 commits

which "inverts" the apply order of S1 vs. S2, because we first apply S2
and then the "old" S1. But as long as we're smart enough to "discard"
applying S1, I think that's acceptable - because it guarantees we'll not
generate duplicate values (with values in the committed transaction).

I'd also argue it does not actually generate invalid state, because once
we commit either transaction, S2 is what's visible.

Yes, if you so "SELECT * FROM sequence" you'll see some intermediate
state, but that's not how sequences are accessed. And you can't do
currval('s') from a transaction that never accessed the sequence.

And if it did, we'd write S2 (or whatever it saw) as part of it's commits.

So I think the main issue of this approach is how to decide which
sequence states are obsolete and should be skipped.

> Your alternative proposal says "The other option might be to make
> these messages non-transactional, in which case we'd separate the
> ordering from COMMIT ordering, evading the reordering problem." But I
> don't think that avoids the reordering problem at all.

I don't understand why. Why would it not address the reordering issue?

> Nor do I think it's correct.

Nor do I understand this. I mean, isn't it essentially the option you
mentioned earlier - treating the non-transactional actions as
independent transactions? Yes, we'd be batching them so that we'd not
see "intermediate" states, but those are not observed by abyone.

> I don't think you *can* separate the ordering of these
> operations from the COMMIT ordering. They are, as I argue here,
> essentially mini-commits that only bump the sequence value, and they
> need to be replicated after the transactions that commit prior to the
> sequence value bump and before those that commit afterward. If they
> aren't handled that way, I don't think you're going to get fully
> correct behavior.

I'm confused. Isn't that pretty much exactly what I'm proposing? Imagine
you have something like this:

1: T1 does something and also increments a sequence
2: T1 logs state of the sequence (right before commit)
3: T1 writes COMMIT

Now when we decode/apply this, we end up doing this:

1: decode all T1 changes, stash them
2: decode the sequence state and apply it separately
3: decode COMMIT, apply all T1 changes

There might be other transactions interleaving with this, but I think
it'd behave correctly. What example would not work?

> 
> I'm going to confess that I have no really specific idea how to
> implement that. I'm just not sufficiently familiar with this code.
> However, I suspect that the solution lies in changing things on the
> decoding side rather than in the WAL format. I feel like the
> information that we need in order to do the right thing must already
> be present in the WAL. If it weren't, then how could crash recovery
> work correctly, or physical replication? At any given moment, you can
> choose to promote a physical standby, and at that point the state you
> observe on the new primary had better be some state that existed on
> the primary at some point in its history. At any moment, you can
> unplug the primary, restart it, and run crash recovery, and if you do,
> you had better end up with some state that existed on the primary at
> some point shortly before the crash. I think that there are actually a
> few subtle inaccuracies in the last two sentences, because actually
> the order in which transactions become visible on a physical standby
> can differ from the order in which it happens on the primary, but I
> don't think that actually changes the picture much. The point is that
> the WAL is the definitive source of information about what happened
> and in what order it happened, and we use it in that way already in
> the context of physical replication, and of standbys. If logical
> decoding has a problem with some case that those systems handle
> correctly, the problem is with logical decoding, not the WAL format.
> 

The problem lies in how we log sequences. If we wrote each individual
increment to WAL, it might work the way you propose (except for cases
with sequences created in a transaction, etc.). But that's not what we
do - we log sequence increments in batches of 32 values, and then only
modify the sequence relfilenode.

This works for physical replication, because the WAL describes the
"next" state of the sequence (so if you do "SELECT * FROM sequence"
you'll not see the same state, and the sequence value may "jump ahead"
after a failover).

But for logical replication this does not work, because the transaction
might depend on a state created (WAL-logged) by some other transaction.
And perhaps that transaction actually happened *before* we even built
the first snapshot for decoding :-/

There's also the issue with what snapshot to use when decoding these
transactional changes in logical decoding (see


> In particular, I think it's likely that the "non-transactional
> messages" that you mention earlier don't get applied at the point in
> the commit sequence where they were found in the WAL. Not sure why
> exactly, but perhaps the point at which we're reading WAL runs ahead
> of the decoding per se, or something like that, and thus those
> non-transactional messages arrive too early relative to the commit
> ordering. Possibly that could be changed, and they could be buffered

I'm not sure which case of "non-transactional messages" this refers to,
so I can't quite respond to these comments. Perhaps you mean the
problems that killed the previous patch [1]?

[1]
https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com


> until earlier commits are replicated. Or else, when we see a WAL
> record for a non-transactional sequence operation, we could arrange to
> bundle that operation into an "adjacent" replicated transaction i.e.

IIRC moving stuff between transactions during decoding is problematic,
because of snapshots.

> the transaction whose commit record occurs most nearly prior to, or
> most nearly after, the WAL record for the operation itself. Or else,
> we could create "virtual" transactions for such operations and make
> sure those get replayed at the right point in the commit sequence. Or
> else, I don't know, maybe something else. But I think the overall
> picture is that we need to approach the problem by replicating changes
> in WAL order, as a physical standby would do. Saying that a change is
> "nontransactional" doesn't mean that it's exempt from ordering
> requirements; rather, it means that that change has its own place in
> that ordering, distinct from the transaction in which it occurred.
> 

But doesn't the approach with WAL-logging sequence state before COMMIT,
and then applying it independently in WAL-order, do pretty much this?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Richard Guo
Date:
Subject: Re: Fix the README file for MERGE command
Next
From: Andres Freund
Date:
Subject: ubsan fails on 32bit builds