Re: logical decoding and replication of sequences, take 2 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical decoding and replication of sequences, take 2
Date
Msg-id 29c4a08e-3610-f338-3e35-75882dcf759e@enterprisedb.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences, take 2  (Andres Freund <andres@anarazel.de>)
Responses Re: logical decoding and replication of sequences, take 2
List pgsql-hackers

On 11/17/22 03:43, Andres Freund wrote:
> Hi,
> 
> 
> On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
>> Well, yeah - we can either try to perform the stuff independently of the
>> transactions that triggered it, or we can try making it part of some of
>> the transactions. Each of those options has problems, though :-(
>>
>> The first version of the patch tried the first approach, i.e. decode the
>> increments and apply that independently. But:
>>
>>   (a) What would you do with increments of sequences created/reset in a
>>       transaction? Can't apply those outside the transaction, because it
>>       might be rolled back (and that state is not visible on primary).
> 
> I think a reasonable approach could be to actually perform different WAL
> logging for that case. It'll require a bit of machinery, but could actually
> result in *less* WAL logging overall, because we don't need to emit a WAL
> record for each SEQ_LOG_VALS sequence values.
> 

Could you elaborate? Hard to comment without knowing more ...

My point was that stuff like this (creating a new sequence or at least a
new relfilenode) means we can't apply that independently of the
transaction (unlike regular increments). I'm not sure how a change to
WAL logging would make that go away.

> 
> 
>>   (b) What about increments created before we have a proper snapshot?
>>       There may be transactions dependent on the increment. This is what
>>       ultimately led to revert of the patch.
> 
> I don't understand this - why would we ever need to process those increments
> from before we have a snapshot?  Wouldn't they, by definition, be before the
> slot was active?
> 
> To me this is the rough equivalent of logical decoding not giving the initial
> state of all tables. You need some process outside of logical decoding to get
> that (obviously we have some support for that via the exported data snapshot
> during slot creation).
> 

Which is what already happens during tablesync, no? We more or less copy
sequences as if they were tables.

> I assume that part of the initial sync would have to be a new sequence
> synchronization step that reads all the sequence states on the publisher and
> ensures that the subscriber sequences are at the same point. There's a bit of
> trickiness there, but it seems entirely doable. The logical replication replay
> support for sequences will have to be a bit careful about not decreasing the
> subscriber's sequence values - the standby initially will be ahead of the
> increments we'll see in the WAL. But that seems inevitable given the
> non-transactional nature of sequences.
> 

See fetch_sequence_data / copy_sequence in the patch. The bit about
ensuring the sequence does not go away (say, using page LSN and/or LSN
of the increment) is not there, however isn't that pretty much what I
proposed doing for "reconciling" the sequence state logged at COMMIT?

> 
>> This version of the patch tries to do the opposite thing - make sure
>> that the state after each commit matches what the transaction might have
>> seen (for sequences it accessed). It's imperfect, because it might log a
>> state generated "after" the sequence got accessed - it focuses on the
>> guarantee not to generate duplicate values.
> 
> That approach seems quite wrong to me.
> 

Why? Because it might log a state for sequence as of COMMIT, when the
transaction accessed the sequence much earlier? That is, this may happen:

T1: nextval('s') -> 1
T2: call nextval('s') 1000000x
T1: commit

and T1 will log sequence state ~1000001, give or take. I don't think
there's way around that, given the non-transactional nature of
sequences. And I'm not convinced this is an issue, as it ensures
uniqueness of values generated on the subscriber. And I think it's
reasonable to replicate the sequence state as of the commit (because
that's what you'd see on the primary).

> 
>>> I'm going to confess that I have no really specific idea how to
>>> implement that. I'm just not sufficiently familiar with this code.
>>> However, I suspect that the solution lies in changing things on the
>>> decoding side rather than in the WAL format. I feel like the
>>> information that we need in order to do the right thing must already
>>> be present in the WAL. If it weren't, then how could crash recovery
>>> work correctly, or physical replication? At any given moment, you can
>>> choose to promote a physical standby, and at that point the state you
>>> observe on the new primary had better be some state that existed on
>>> the primary at some point in its history. At any moment, you can
>>> unplug the primary, restart it, and run crash recovery, and if you do,
>>> you had better end up with some state that existed on the primary at
>>> some point shortly before the crash.
> 
> One minor exception here is that there's no real time bound to see the last
> few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL
> flush.
> 

Right. Another issue is we ignore stuff that happened in aborted
transactions, so then nextval('s') in another transaction may not wait
for syncrep to confirm receiving that WAL. Which is a data loss case,
see [1]:

[1]
https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9%40enterprisedb.com

> 
>>> I think that there are actually a
>>> few subtle inaccuracies in the last two sentences, because actually
>>> the order in which transactions become visible on a physical standby
>>> can differ from the order in which it happens on the primary, but I
>>> don't think that actually changes the picture much. The point is that
>>> the WAL is the definitive source of information about what happened
>>> and in what order it happened, and we use it in that way already in
>>> the context of physical replication, and of standbys. If logical
>>> decoding has a problem with some case that those systems handle
>>> correctly, the problem is with logical decoding, not the WAL format.
>>>
>>
>> The problem lies in how we log sequences. If we wrote each individual
>> increment to WAL, it might work the way you propose (except for cases
>> with sequences created in a transaction, etc.). But that's not what we
>> do - we log sequence increments in batches of 32 values, and then only
>> modify the sequence relfilenode.
> 
>> This works for physical replication, because the WAL describes the
>> "next" state of the sequence (so if you do "SELECT * FROM sequence"
>> you'll not see the same state, and the sequence value may "jump ahead"
>> after a failover).
>>
>> But for logical replication this does not work, because the transaction
>> might depend on a state created (WAL-logged) by some other transaction.
>> And perhaps that transaction actually happened *before* we even built
>> the first snapshot for decoding :-/
> 
> I really can't follow the "depend on state ... by some other transaction"
> aspect.
> 

T1: nextval('s') -> writes WAL, covering by the next 32 increments
T2: nextval('s') -> no WAL generated, covered by T1 WAL

This is what I mean by "dependency" on state logged by another
transaction. It already causes problems with streaming replication (see
the reference to syncrep above), logical replication has the same issue.

> 
> Even the case of a sequence that is renamed inside a transaction that did
> *not* create / reset the sequence and then also triggers increment of the
> sequence seems to be dealt with reasonably by processing sequence increments
> outside a transaction - the old name will be used for the increments, replay
> of the renaming transaction would then implement the rename in a hypothetical
> DDL-replay future.
> 
> 
>> There's also the issue with what snapshot to use when decoding these
>> transactional changes in logical decoding (see
> 
> Incomplete parenthetical? Or were you referencing the next paragraph?
> 
> What are the transactional changes you're referring to here?
> 

Sorry, IIRC I merely wanted to mention/reference the snapshot issue in
the thread [2] that I ended up referencing in the next paragraph.


[2]
https://www.postgresql.org/message-id/00708727-d856-1886-48e3-811296c7ba8c%40enterprisedb.com

> 
> I did some skimming of the referenced thread about the reversal of the last
> approach, but I couldn't really understand what the fundamental issues were
> with the reverted implementation - it's a very long thread and references
> other threads.
> 

Yes, it's long/complex, but I intentionally linked to a specific message
which describes the issue ...

It's entirely possible there is a simple fix for the issue, and I just
got confused / unable to see the solution. The whole issue was due to
having a mix of transactional and non-transactional cases, similarly to
logical messages - and logicalmsg_decode() has the same issue, so maybe
let's talk about that for a moment.

See [3] and imagine you're dealing with a transactional message, but
you're still building a consistent snapshot. So the first branch applies:

    if (transactional &&
        !SnapBuildProcessChange(builder, xid, buf->origptr))
        return;

but because we don't have a snapshot, SnapBuildProcessChange does this:

    if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
        return false;

which however means logicalmsg_decode() does

    snapshot = SnapBuildGetOrBuildSnapshot(builder);

which crashes, because it hits this assert:

    Assert(builder->state == SNAPBUILD_CONSISTENT);

The sequence decoding did almost the same thing, with the same issue.
Maybe the correct thing to do is to just ignore the change in this case?
Presumably it'd be replicated by tablesync. But we've been unable to
convince ourselves that's correct, or what snapshot to pass to
ReorderBufferQueueMessage/ReorderBufferQueueSequence.


[3]
https://github.com/postgres/postgres/blob/master/src/backend/replication/logical/decode.c#L585


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: locale -a missing on Alpine Linux?
Next
From: Maxim Orlov
Date:
Subject: Re: [PoC] configurable out of disk space elog level