Re: logical decoding and replication of sequences - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: logical decoding and replication of sequences |
Date | |
Msg-id | e6ef8862-31e4-1582-9274-f8f57b7f1cab@enterprisedb.com Whole thread Raw |
In response to | Re: logical decoding and replication of sequences (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: logical decoding and replication of sequences
|
List | pgsql-hackers |
Hi, I've spent a bit of time exploring the alternative approach outlined by Hannu, i.e. tracking sequences accessed by the transaction, and logging the final state just once at COMMIT. Attached is an experimental version of the patch series doing that - 0001 does the original approach (decoding the sequence updates from WAL) and then 0002 reworks it to this alternative solution. The 0003 and 0004 stay mostly the same, except for minor fixes. Some of the tests in 0003/0004 fail, because 0002 changes the semantics in various ways (more about that later). The original approach (0001) may seem complex at first, but in principle it just decodes changes to the sequence relation, and either stashes them into transaction (just like other changes) or applies them right away. I'd say that's the most complicated part - deciding whether the change is transactional or not. 0002 reworks that so that it doesn't decode the existing WAL records, but tracks sequences which have been modified (updated on-disk state) and then accessed in the current transaction. And then at COMMIT time we write a new WAL message with info about the sequence. I realized we already cache sequences for each session - seqhashtab in sequence.c. It doesn't have any concept of a transaction, but it seems fairly easy to make that possible. I did this by adding two flags - needs_log - means the seesion advanced the sequence (on disk) - touched - true if the current xact called nextval() etc. The idea is that what matters is updates to on-disk state, so whenever we do that we set needs_log. But it only matters when the changes are made visible in a committed transaction. Consider for example this: BEGIN; SELECT nextval('s') FROM generate_series(1,10000) s(i); ROLLBACK; SELECT nextval('s'); The first nextval() call certainly sets both flags to true, at least for default sequences caching 32 values. But the values are not confirmed to the user because of the rollback - this resets 'touched' flag, but leaves 'needs_log' set to true. And then the next nextval() - which may easily be just from cache - sets touched=true again, and logs the sequence state at (implicit) commit. Which resets both flags again. The logging/cleanup happens in AtEOXact_Sequences() which gets called before commit/abort. This walks all cached sequences and writes the state for those with both flags true (or resets flag for abort). The cache also keeps info about the last "sequence state" in the session, which is then used when writing into into WAL. To write the sequence state into WAL, I've added a new WAL record xl_logical_sequence to RM_LOGICALMSG_ID, next to the xl_logical_message. It's a bit arbitrary, maybe it should be part of RM_SEQ_ID, but it does the trick. I don't think this is the main issue and it's easy enough to move it elsewhere if needed. So, that seems fairly straight-forward and it may reduce the number of replication messages for large transactions. Unfortunately, it's not much simpler compared to the first approach - the amount of code is about the same, and there's a bunch of other issues. The main issue seems to be about ordering. Consider multiple sessions all advancing the sequence. With the "old" approach this was naturally ordered - the order in which the increments were written to WAL made sense. But the sessions may advance the sequences in one order and then commit in a different order, which mixes the updates. Consider for example this scenario with two concurrent transactions: T1: nextval('s') -> allocates values [1,32] T2: nextval('s') -> allocates values [33,64] T2: commit -> logs [33,64] T1: commit -> logs [1,32] The result is the sequence on the replica diverted because it replayed the increments in the opposite order. I can think of two ways to fix this. Firstly, we could "merge" the increments in some smart way, e.g. by discarding values considered "stale" (like decrements). But that seems pretty fragile, because the sequence may be altered in various ways, reset, etc. And it seems more like transferring responsibility to someone else instead of actually solving the issue. The other fix is simply reading the current sequence state from disk at commit and logging that (instead of the values cached from the last increment). But I'm rather skeptical about doing such things right before COMMIT. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: