Re: logical decoding and replication of sequences, take 2 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical decoding and replication of sequences, take 2
Date
Msg-id 20221117024357.ljjme6v75mny2j6u@awork3.anarazel.de
Whole thread Raw
In response to Re: logical decoding and replication of sequences, take 2  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: logical decoding and replication of sequences, take 2
List pgsql-hackers
Hi,


On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
> Well, yeah - we can either try to perform the stuff independently of the
> transactions that triggered it, or we can try making it part of some of
> the transactions. Each of those options has problems, though :-(
>
> The first version of the patch tried the first approach, i.e. decode the
> increments and apply that independently. But:
>
>   (a) What would you do with increments of sequences created/reset in a
>       transaction? Can't apply those outside the transaction, because it
>       might be rolled back (and that state is not visible on primary).

I think a reasonable approach could be to actually perform different WAL
logging for that case. It'll require a bit of machinery, but could actually
result in *less* WAL logging overall, because we don't need to emit a WAL
record for each SEQ_LOG_VALS sequence values.



>   (b) What about increments created before we have a proper snapshot?
>       There may be transactions dependent on the increment. This is what
>       ultimately led to revert of the patch.

I don't understand this - why would we ever need to process those increments
from before we have a snapshot?  Wouldn't they, by definition, be before the
slot was active?

To me this is the rough equivalent of logical decoding not giving the initial
state of all tables. You need some process outside of logical decoding to get
that (obviously we have some support for that via the exported data snapshot
during slot creation).

I assume that part of the initial sync would have to be a new sequence
synchronization step that reads all the sequence states on the publisher and
ensures that the subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical replication replay
support for sequences will have to be a bit careful about not decreasing the
subscriber's sequence values - the standby initially will be ahead of the
increments we'll see in the WAL. But that seems inevitable given the
non-transactional nature of sequences.



> This version of the patch tries to do the opposite thing - make sure
> that the state after each commit matches what the transaction might have
> seen (for sequences it accessed). It's imperfect, because it might log a
> state generated "after" the sequence got accessed - it focuses on the
> guarantee not to generate duplicate values.

That approach seems quite wrong to me.


> > I'm going to confess that I have no really specific idea how to
> > implement that. I'm just not sufficiently familiar with this code.
> > However, I suspect that the solution lies in changing things on the
> > decoding side rather than in the WAL format. I feel like the
> > information that we need in order to do the right thing must already
> > be present in the WAL. If it weren't, then how could crash recovery
> > work correctly, or physical replication? At any given moment, you can
> > choose to promote a physical standby, and at that point the state you
> > observe on the new primary had better be some state that existed on
> > the primary at some point in its history. At any moment, you can
> > unplug the primary, restart it, and run crash recovery, and if you do,
> > you had better end up with some state that existed on the primary at
> > some point shortly before the crash.

One minor exception here is that there's no real time bound to see the last
few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL
flush.


> > I think that there are actually a
> > few subtle inaccuracies in the last two sentences, because actually
> > the order in which transactions become visible on a physical standby
> > can differ from the order in which it happens on the primary, but I
> > don't think that actually changes the picture much. The point is that
> > the WAL is the definitive source of information about what happened
> > and in what order it happened, and we use it in that way already in
> > the context of physical replication, and of standbys. If logical
> > decoding has a problem with some case that those systems handle
> > correctly, the problem is with logical decoding, not the WAL format.
> >
>
> The problem lies in how we log sequences. If we wrote each individual
> increment to WAL, it might work the way you propose (except for cases
> with sequences created in a transaction, etc.). But that's not what we
> do - we log sequence increments in batches of 32 values, and then only
> modify the sequence relfilenode.

> This works for physical replication, because the WAL describes the
> "next" state of the sequence (so if you do "SELECT * FROM sequence"
> you'll not see the same state, and the sequence value may "jump ahead"
> after a failover).
>
> But for logical replication this does not work, because the transaction
> might depend on a state created (WAL-logged) by some other transaction.
> And perhaps that transaction actually happened *before* we even built
> the first snapshot for decoding :-/

I really can't follow the "depend on state ... by some other transaction"
aspect.


Even the case of a sequence that is renamed inside a transaction that did
*not* create / reset the sequence and then also triggers increment of the
sequence seems to be dealt with reasonably by processing sequence increments
outside a transaction - the old name will be used for the increments, replay
of the renaming transaction would then implement the rename in a hypothetical
DDL-replay future.


> There's also the issue with what snapshot to use when decoding these
> transactional changes in logical decoding (see

Incomplete parenthetical? Or were you referencing the next paragraph?

What are the transactional changes you're referring to here?


I did some skimming of the referenced thread about the reversal of the last
approach, but I couldn't really understand what the fundamental issues were
with the reverted implementation - it's a very long thread and references
other threads.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Meson add host_system to PG_VERSION_STR
Next
From: Michael Paquier
Date:
Subject: Re: User functions for building SCRAM secrets