Re: logical decoding and replication of sequences, take 2 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: logical decoding and replication of sequences, take 2 |
Date | |
Msg-id | 195be236-7580-72c6-97c7-59f06c4e093c@enterprisedb.com Whole thread Raw |
In response to | Re: logical decoding and replication of sequences, take 2 (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: logical decoding and replication of sequences, take 2
|
List | pgsql-hackers |
On 11/28/23 12:32, Amit Kapila wrote: > On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> I spent a bit of time looking at the proposed change, and unfortunately >> logging just the boolean flag does not work. A good example is this bit >> from a TAP test added by the patch for built-in replication (which was >> not included with the WIP patch): >> >> BEGIN; >> ALTER SEQUENCE s RESTART WITH 1000; >> SAVEPOINT sp1; >> INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100); >> ROLLBACK TO sp1; >> COMMIT; >> >> This is expected to produce: >> >> 1131|0|t >> >> but produces >> >> 1000|0|f >> >> instead. The reason is very simple - as implemented, the patch simply >> checks if the relfilenode is from the same top-level transaction, which >> it is, and sets the flag to "true". So we know the sequence changes need >> to be queued and replayed as part of this transaction. >> >> But then during decoding, we still queue the changes into the subxact, >> which then aborts, and the changes are discarded. That is not how it's >> supposed to work, because the new relfilenode is still valid, someone >> might do nextval() and commit. And the nextval() may not get WAL-logged, >> so we'd lose this. >> >> What I guess we might do is log not just a boolean flag, but the XID of >> the subtransaction that created the relfilenode. And then during >> decoding we'd queue the changes into this subtransaction ... >> >> 0006 in the attached patch series does this, and it seems to fix the TAP >> test failure. I left it at the end, to make it easier to run tests >> without the patch applied. >> > > Offhand, I don't have any better idea than what you have suggested for > the problem but this needs some thoughts including the questions asked > by you. I'll spend some time on it and respond back. > I've been experimenting with the idea to log the XID, and for a moment I was worried it actually can't work, because subtransactions may not actually be just nested in simple way, but form a tree. And what if the sequence was altered in a different branch (sibling subxact), not in the immediate parent. In which case the new SubTransactionGetXid() would fail, because it just walks the current chain of subtransactions. I've been thinking about cases like this: BEGIN; CREATE SEQUENCE s; # XID 1000 SELECT alter_sequence(); # XID 1001 SAVEPOINT s1; SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000 ROLLBACK TO s1; SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000 COMMIT; The XID values are what the sequence wal record will reference, assuming that the main transaction XID is 1000. Initially, I thought it's wrong that the nextval() calls reference XID of the main transaction, because the last relfilenode comes from 1001, which is the subxact created by alter_sequence() thanks to the exception handling block. And that's where the approach in reorderbuffer would queue the changes. But I think this is actually correct too. When a subtransaction commits (e.g. when alter_sequence() completes), it essentially becomes part of the parent. And AtEOSubXact_cleanup() updates rd_newRelfilelocatorSubid accordingly, setting it to parentSubid. This also means that SubTransactionGetXid() can't actually fail, because the ID has to reference an active subtransaction in the current stack. I'm still concerned about the cost of the lookup, because the list may be long and the subxact we're looking for may be quite high, but I guess we might have another field, caching the XID. It'd need to be updated only in AtEOSubXact_cleanup, and at that point we know it's the immediate parent, so it'd be pretty cheap I think. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: