Re: logical decoding and replication of sequences - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical decoding and replication of sequences
Date
Msg-id e6ef8862-31e4-1582-9274-f8f57b7f1cab@enterprisedb.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: logical decoding and replication of sequences
List pgsql-hackers
Hi,

I've spent a bit of time exploring the alternative approach outlined by 
Hannu, i.e. tracking sequences accessed by the transaction, and logging 
the final state just once at COMMIT. Attached is an experimental version 
of the patch series doing that - 0001 does the original approach 
(decoding the sequence updates from WAL) and then 0002 reworks it to 
this alternative solution. The 0003 and 0004 stay mostly the same, 
except for minor fixes. Some of the tests in 0003/0004 fail, because 
0002 changes the semantics in various ways (more about that later).

The original approach (0001) may seem complex at first, but in principle 
it just  decodes changes to the sequence relation, and either stashes 
them into transaction (just like other changes) or applies them right 
away. I'd say that's the most complicated part - deciding whether the 
change is transactional or not.

0002 reworks that so that it doesn't decode the existing WAL records, 
but tracks sequences which have been modified (updated on-disk state) 
and then accessed in the current transaction. And then at COMMIT time we 
write a new WAL message with info about the sequence.

I realized we already cache sequences for each session - seqhashtab in 
sequence.c. It doesn't have any concept of a transaction, but it seems 
fairly easy to make that possible. I did this by adding two flags

  - needs_log - means the seesion advanced the sequence (on disk)
  - touched   - true if the current xact called nextval() etc.

The idea is that what matters is updates to on-disk state, so whenever 
we do that we set needs_log. But it only matters when the changes are 
made visible in a committed transaction. Consider for example this:

BEGIN;
SELECT nextval('s') FROM generate_series(1,10000) s(i);
ROLLBACK;
SELECT nextval('s');

The first nextval() call certainly sets both flags to true, at least for 
default sequences caching 32 values. But the values are not confirmed to 
the user because of the rollback - this resets 'touched' flag, but 
leaves 'needs_log' set to true.

And then the next nextval() - which may easily be just from cache - sets 
touched=true again, and logs the sequence state at (implicit) commit. 
Which resets both flags again.

The logging/cleanup happens in AtEOXact_Sequences() which gets called 
before commit/abort. This walks all cached sequences and writes the 
state for those with both flags true (or resets flag for abort).

The cache also keeps info about the last "sequence state" in the 
session, which is then used when writing into into WAL.


To write the sequence state into WAL, I've added a new WAL record 
xl_logical_sequence to RM_LOGICALMSG_ID, next to the xl_logical_message. 
It's a bit arbitrary, maybe it should be part of RM_SEQ_ID, but it does 
the trick. I don't think this is the main issue and it's easy enough to 
move it elsewhere if needed.

So, that seems fairly straight-forward and it may reduce the number of 
replication messages for large transactions. Unfortunately, it's not 
much simpler compared to the first approach - the amount of code is 
about the same, and there's a bunch of other issues.

The main issue seems to be about ordering. Consider multiple sessions 
all advancing the sequence. With the "old" approach this was naturally 
ordered - the order in which the increments were written to WAL made 
sense. But the sessions may advance the sequences in one order and then 
commit in a different order, which mixes the updates. Consider for 
example this scenario with two concurrent transactions:

T1: nextval('s') -> allocates values [1,32]
T2: nextval('s') -> allocates values [33,64]
T2: commit -> logs [33,64]
T1: commit -> logs [1,32]

The result is the sequence on the replica diverted because it replayed 
the increments in the opposite order.

I can think of two ways to fix this. Firstly, we could "merge" the 
increments in some smart way, e.g. by discarding values considered 
"stale" (like decrements). But that seems pretty fragile, because the 
sequence may be altered in various ways, reset, etc. And it seems more 
like transferring responsibility to someone else instead of actually 
solving the issue.

The other fix is simply reading the current sequence state from disk at 
commit and logging that (instead of the values cached from the last 
increment). But I'm rather skeptical about doing such things right 
before COMMIT.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Why not try for a HOT update, even when PageIsFull()?
Next
From: Alvaro Herrera
Date:
Subject: Re: xlog.c: removing ReadRecPtr and EndRecPtr