Re: logical decoding and replication of sequences - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical decoding and replication of sequences
Date
Msg-id 95345a19-d508-63d1-860a-f5c2f41e8d40@enterprisedb.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: logical decoding and replication of sequences
Re: logical decoding and replication of sequences
List pgsql-hackers

On 4/2/22 13:51, Tomas Vondra wrote:
> 
> 
> On 4/2/22 12:35, Amit Kapila wrote:
>> On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> On 3/28/22 07:29, Amit Kapila wrote:
>>>> I thought about changing snapshot dealing of
>>>> non-transactional sequence changes similar to transactional ones but
>>>> that also won't work because it is only at commit we decide whether we
>>>> can send the changes.
>>>>
>>> I wonder if there's some earlier LSN (similar to the consistent point)
>>> which might be useful for this.
>>>
>>> Or maybe we should queue even the non-transactional changes, not
>>> per-transaction but in a global list, and then at each commit either
>>> discard inspect them (at that point we know the lowest LSN for all
>>> transactions and the consistent point). Seems complex, though.
>>>
>>
>> I couldn't follow '..discard inspect them ..'. Do you mean we inspect
>> them and discard whichever are not required? It seems here we are
>> talking about a new global ReorderBufferGlobal instead of
>> ReorderBufferTXN to collect these changes but we don't need only
>> consistent point LSN because we do send if the commit of containing
>> transaction is after consistent point LSN, so we need some transaction
>> information as well. I think it could bring new challenges.
>>
> 
> Sorry for the gibberish. Yes, I meant to discard sequence changes that
> are no longer needed, due to being "obsoleted" by the applied change. We
> must not apply "older" changes (using LSN) because that would make the
> sequence go backwards.
> 
> I'm not entirely sure whether the list of changes should be kept in TXN
> or in the global reorderbuffer object - we need to track which TXN the
> change belongs to (because of transactional changes) but we also need to
> discard the unnecessary changes efficiently (and walking TXN might be
> expensive).
> 
> But yes, I'm sure there will be challenges. One being that tracking just
> the decoded WAL stuff is not enough, because nextval() may not generate
> WAL. But we still need to make sure the increment is replicated.
> 
> What I think we might do is this:
> 
> - add a global list of decoded sequence increments to ReorderBuffer
> 
> - at each commit/abort walk the list, walk the list and consider all
> increments up to the commit LSN that "match" (non-transactional match
> all TXNs, transactional match only the current TXN)
> 
> - replicate the last "matching" status for each sequence, discard the
> processed ones
> 
> We could probably optimize this by not tracking every single increment,
> but merge them "per transaction", I think.
> 
> I'm sure this description is pretty rough and will need refining, handle
> various corner-cases etc.
> 
I did some experiments over the weekend, exploring how to rework the
sequence decoding in various ways. Let me share some WIP patches,
hopefully that can be useful for trying more stuff and moving this
discussion forward.

I tried two things - (1) accumulating sequence increments in global
array and then doing something with it, and (2) treating all sequence
increments as regular changes (in a TXN) and then doing something
special during the replay. Attached are two patchsets, one for each
approach.

Note: It's important to remember decoding of sequences is not the only
code affected by this. The logical messages have the same issue,
certainly when it comes to transactional vs. non-transactional stuff and
handling of snapshots. Even if the sequence decoding ends up being
reverted, we still need to fix that, somehow. And my feeling is the
solutions ought to be pretty similar in both cases.

Now, regarding the two approaches:

(1) accumulating sequences in global hash table

The main problem with regular sequence increments is that those need to
be non-transactional - a transaction may use a sequence without any
WAL-logging, if the WAL was written by an earlier transaction. The
problem is the earlier trasaction might have been rolled back, and thus
simply discarded by the logical decoding. But we still need to apply
that, in order not to lose the sequence increment.

The current code just applies those non-transactional increments right
after decoding the increment, but that does not work because we may not
have a snapshot at that point. And we only have the snapshot when within
a transaction (AFAICS) so this queues all changes and then applies the
changes later.

The changes need to be shared by all transactions, so queueing them in a
global works fairly well - otherwise we'd have to walk all transactions,
in order to see if there are relevant sequence increments.

But some increments may be transactional, e.g. when the sequence is
created or altered in a transaction. To allow tracking this, this uses a
hash table, with relfilenode as a key.

There's a couple issues with this, though. Firstly, stashing the changes
outside transactions, it's not included in memory accounting, it's not
spilled to disk or streamed, etc. I guess fixing this is possible, but
it's certainly not straightforward, because we mix increments from many
different transactions.

A bigger issue is that I'm not sure this actually handles the snapshots
correctly either.

The non-transactional increments affect all transactions, so when
ReorderBufferProcessSequences gets executed, it processes all of them,
no matter the source transaction. Can we be sure the snapshot in the
applying transaction is the same (or "compatible") as the snapshot in
the source transaction?

Transactional increments can be simply processed as regular changes, of
course, but one difference is that we always create the transaction
(while before we just triggered the apply callback). This is necessary
as now we drive all of this from ReorderBufferCommit(), and without the
transaction the increment not be applied / there would be no snapshot.

It does seem to work, though, although I haven't tested it much so far.
One annoying bit is that we have to always walk all sequences and
increments, for each change in the transaction. Which seems quite
expensive, although the number of in-progress increments should be
pretty low (roughly equal to the number of sequences). Or at least the
part we need to consider for a single change (i.e. between two LSNs).

So maybe this should work. The one part this does not handle at all are
aborted transactions. At the moment we just discard those, which means
(a) we fail to discard the transactional changes from the hash table,
and (b) we can't apply the non-transactional changes, because with the
changes we also discard the snapshots we need.

I wonder if we could use a different snapshot, though. Non-transactional
changes can't change the relfilenode, after all. Not sure. If not, the
only solution I can think of is processing even aborted transactions,
but skipping changes except those that update snapshots.

There's a serious problem with streaming, though - we don't know which
transaction will commit first, hence we can't decide whether to send the
sequence changes. This seems pretty fatal to me. So we'd have to stream
the sequence changes only at commit, and then do some of this work on
the worker (i.e. merge the sequence changes to the right place). That
seems pretty annoying.


(2) treating sequence change as regular changes

This adopts a different approach - instead of accumulating the sequence
increments in a global hash table, it treats them as regular changes.
Which solves the snapshot issue, and issues with spilling to disk,
streaming and so on.

But it has various other issues with handling concurrent transactions,
unfortunately, which probably make this approach infeasible:

* The non-transactional stuff has to be applied in the first transaction
that commits, not in the transaction that generated the WAL. That does
not work too well with this approach, because we have to walk changes in
all other transactions.

* Another serious issue seems to be streaming - if we already streamed
some of the changes, we can't iterate through them anymore.

Also, having to walk the transactions over and over for each change, to
apply relevant sequence increments, that's mighty expensive. The other
approach needs to do that too, but walking the global hash table seems
much cheaper.

The other issue this handling of aborted transactions - we need to apply
sequence increments even from those transactions, of course. The other
approach has this issue too, though.


(3) tracking sequences touched by transaction

This is the approach proposed by Hannu Krosing. I haven't explored this
again yet, but I recall I wrote a PoC patch a couple months back.

It seems to me most of the problems stems from trying to derive sequence
state from decoded WAL changes, which is problematic because of the
non-transactional nature of sequences (i.e. WAL for one transaction
affects other transactions in non-obvious ways). And this approach
simply works around that entirely - instead of trying to deduce the
sequence state from WAL, we'd make sure to write the current sequence
state (or maybe just ID of the sequence) at commit time. Which should
eliminate most of the complexity / problems, I think.


I'm not really sure what to do about this. All of those reworks seems
like an extensive redesign of the patch, and considering the last CF is
already over ... not great.

However, even if we end up reverting this, we'll still have the same
problem with snapshots for logical messages.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: unlogged sequences
Next
From: Thomas Munro
Date:
Subject: Re: CLUSTER sort on abbreviated expressions is broken