Re: logical decoding and replication of sequences - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: logical decoding and replication of sequences |
Date | |
Msg-id | 1882b44e-8abd-8f2f-29a3-21453801db15@enterprisedb.com Whole thread Raw |
In response to | Re: logical decoding and replication of sequences (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: logical decoding and replication of sequences
(Noah Misch <noah@leadboat.com>)
|
List | pgsql-hackers |
On 4/6/22 16:13, Tomas Vondra wrote: > > > On 4/5/22 12:06, Amit Kapila wrote: >> On Mon, Apr 4, 2022 at 3:10 AM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> I did some experiments over the weekend, exploring how to rework the >>> sequence decoding in various ways. Let me share some WIP patches, >>> hopefully that can be useful for trying more stuff and moving this >>> discussion forward. >>> >>> I tried two things - (1) accumulating sequence increments in global >>> array and then doing something with it, and (2) treating all sequence >>> increments as regular changes (in a TXN) and then doing something >>> special during the replay. Attached are two patchsets, one for each >>> approach. >>> >>> Note: It's important to remember decoding of sequences is not the only >>> code affected by this. The logical messages have the same issue, >>> certainly when it comes to transactional vs. non-transactional stuff and >>> handling of snapshots. Even if the sequence decoding ends up being >>> reverted, we still need to fix that, somehow. And my feeling is the >>> solutions ought to be pretty similar in both cases. >>> >>> Now, regarding the two approaches: >>> >>> (1) accumulating sequences in global hash table >>> >>> The main problem with regular sequence increments is that those need to >>> be non-transactional - a transaction may use a sequence without any >>> WAL-logging, if the WAL was written by an earlier transaction. The >>> problem is the earlier trasaction might have been rolled back, and thus >>> simply discarded by the logical decoding. But we still need to apply >>> that, in order not to lose the sequence increment. >>> >>> The current code just applies those non-transactional increments right >>> after decoding the increment, but that does not work because we may not >>> have a snapshot at that point. And we only have the snapshot when within >>> a transaction (AFAICS) so this queues all changes and then applies the >>> changes later. >>> >>> The changes need to be shared by all transactions, so queueing them in a >>> global works fairly well - otherwise we'd have to walk all transactions, >>> in order to see if there are relevant sequence increments. >>> >>> But some increments may be transactional, e.g. when the sequence is >>> created or altered in a transaction. To allow tracking this, this uses a >>> hash table, with relfilenode as a key. >>> >>> There's a couple issues with this, though. Firstly, stashing the changes >>> outside transactions, it's not included in memory accounting, it's not >>> spilled to disk or streamed, etc. I guess fixing this is possible, but >>> it's certainly not straightforward, because we mix increments from many >>> different transactions. >>> >>> A bigger issue is that I'm not sure this actually handles the snapshots >>> correctly either. >>> >>> The non-transactional increments affect all transactions, so when >>> ReorderBufferProcessSequences gets executed, it processes all of them, >>> no matter the source transaction. Can we be sure the snapshot in the >>> applying transaction is the same (or "compatible") as the snapshot in >>> the source transaction? >>> >> >> I don't think we can assume that. I think it is possible that some >> other transaction's WAL can be in-between start/end lsn of txn (which >> we decide to send) which may not finally reach a consistent state. >> Consider a case similar to shown in one of my previous emails: >> Session-2: >> Begin; >> SELECT pg_current_xact_id(); >> >> Session-1: >> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot', >> 'test_decoding', false, true); >> >> Session-3: >> Begin; >> SELECT pg_current_xact_id(); >> >> Session-2: >> Commit; >> Begin; >> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100); >> >> Session-3: >> Commit; >> >> Session-2: >> Commit; >> >> Here, we send changes (say insert from txn 700) from session-2 because >> session-3's commit happens before it. Now, consider another >> transaction parallel to txn 700 which generates some WAL related to >> sequences but it committed before session-3's commit. So though, its >> changes will be the in-between start/end LSN of txn 700 but those >> shouldn't be sent. >> >> I have not tried this and also this may be solvable in some way but I >> think processing changes from other TXNs sounds risky to me in terms >> of snapshot handling. >> > > Yes, I know this can happen. I was only really thinking about what might > happen to the relfilenode of the sequence itself - and I don't think any > concurrent transaction could swoop in and change the relfilenode in any > meaningful way, due to locking. > > But of course, if we expect/require to have a perfect snapshot for that > exact position in the transaction, this won't work. IMO the whole idea > that we can have non-transactional bits in naturally transactional > decoding seems a bit suspicious (at least in hindsight). > > No matter what we do for sequences, though, this still affects logical > messages too. Not sure what to do there :-( > >>> >>> >>> (2) treating sequence change as regular changes >>> >>> This adopts a different approach - instead of accumulating the sequence >>> increments in a global hash table, it treats them as regular changes. >>> Which solves the snapshot issue, and issues with spilling to disk, >>> streaming and so on. >>> >>> But it has various other issues with handling concurrent transactions, >>> unfortunately, which probably make this approach infeasible: >>> >>> * The non-transactional stuff has to be applied in the first transaction >>> that commits, not in the transaction that generated the WAL. That does >>> not work too well with this approach, because we have to walk changes in >>> all other transactions. >>> >> >> Why do you want to traverse other TXNs in this approach? Is it because >> the current TXN might be using some value of sequence which has been >> actually WAL logged in the other transaction but that other >> transaction has not been sent yet? I think if we don't send that then >> probably replica sequences columns (in some tables) have some values >> but actually the sequence itself won't have still that value which >> sounds problematic. Is that correct? >> > > Well, how else would you get to sequence changes in the other TXNs? > > Consider this: > > T1: begin > T2: begin > > T2: nextval('s') -> writes WAL for 32 values > T1: nextval('s') -> gets value without WAL > > T1: commit > T2: commit > > Now, if we commit T1 without "applying" the sequence change from T2, we > loose the sequence state. But we still write/replicate the value > generated from the sequence. > >>> * Another serious issue seems to be streaming - if we already streamed >>> some of the changes, we can't iterate through them anymore. >>> >>> Also, having to walk the transactions over and over for each change, to >>> apply relevant sequence increments, that's mighty expensive. The other >>> approach needs to do that too, but walking the global hash table seems >>> much cheaper. >>> >>> The other issue this handling of aborted transactions - we need to apply >>> sequence increments even from those transactions, of course. The other >>> approach has this issue too, though. >>> >>> >>> (3) tracking sequences touched by transaction >>> >>> This is the approach proposed by Hannu Krosing. I haven't explored this >>> again yet, but I recall I wrote a PoC patch a couple months back. >>> >>> It seems to me most of the problems stems from trying to derive sequence >>> state from decoded WAL changes, which is problematic because of the >>> non-transactional nature of sequences (i.e. WAL for one transaction >>> affects other transactions in non-obvious ways). And this approach >>> simply works around that entirely - instead of trying to deduce the >>> sequence state from WAL, we'd make sure to write the current sequence >>> state (or maybe just ID of the sequence) at commit time. Which should >>> eliminate most of the complexity / problems, I think. >>> >> >> That sounds promising but I haven't thought in detail about that approach. >> > > So, here's a patch doing that. It's a reworked/improved version of the > patch [1] shared in November. > > It seems to be working pretty nicely. The behavior is a little bit > different, of course, because we only replicate "committed" changes, so > if you do nextval() in aborted transaction that is not replicated. Which > I think is fine, because we generally make no durability guarantees for > aborted transactions in general. > > But there are a couple issues too: > > 1) locking > > We have to read sequence change before the commit, but we must not allow > reordering (because then the state might go backwards again). I'm not > sure how serious impact could this have on performance. > > 2) dropped sequences > > I'm not sure what to do about sequences dropped in the transaction. The > patch simply attempts to read the current sequence state before the > commit, but if the sequence was dropped (in that transaction), that > can't happen. I'm not sure if that's OK or not. > > 3) WAL record > > To replicate the stuff the patch uses a LogicalMessage, but I guess a > separate WAL record would be better. But that's a technical detail. > > > regards > > [1] > https://www.postgresql.org/message-id/2cd38bab-c874-8e0b-98e7-d9abaaf9806a@enterprisedb.com > >>> >>> I'm not really sure what to do about this. All of those reworks seems >>> like an extensive redesign of the patch, and considering the last CF is >>> already over ... not great. >>> >> >> Yeah, I share the same feeling that even if we devise solutions to all >> the known problems it requires quite some time to ensure everything is >> correct. >> > > True. Let's keep working on this for a bit more time and then we can > decide what to do. > I've pushed a revert af all the commits related to this - decoding of sequences and test_decoding / built-in replication changes. The approach combining transactional and non-transactional behavior implemented by the patch clearly has issues, and it seems foolish to hope we'll find a simple fix. So the changes would have to be much more extensive, and doing that after the last CF seems like an obviously bad idea. Attached is a rebased patch, implementing the approach based on WAL-logging sequences at commit time. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: