Re: logical decoding and replication of sequences - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical decoding and replication of sequences
Date
Msg-id 3d6df331-5532-6848-eb45-344b265e0238@enterprisedb.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: logical decoding and replication of sequences
List pgsql-hackers
Hi,

Here's a an updated version of this patch - rebased to current master, 
and fixing some of the issues raised in Peter's review.

Mainly, this adds a TAP test to src/test/subscription, focusing on 
testing the various situations with transactional and non-transactional 
behavior (with subtransactions and various ROLLBACK versions).

This new TAP test however uncovered an issue with wait_for_catchup(), 
because that uses pg_current_wal_lsn() to wait for replication of all 
the changes. But that does not work when the sequence gets advanced in a 
transaction that is then aborted, e.g. like this:

BEGIN;
SELECT nextval('s') FROM generate_series(1,100);
ROLLBACK;

The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, 
which is updated by XLogFlush() - but only in RecordTransactionCommit. 
Which makes sense, because only the committed stuff is "visible". But 
the non-transactional behavior changes this, because now some of the 
changes from aborted transactions may need to be replicated. Which means 
the wait_for_catchup() ends up not waiting for the sequence change.

One option would be adding XLogFlush() to RecordTransactionAbort(), but 
my guess is we don't do that intentionally (even though aborts should be 
fairly rare in most workloads).

I'm not entirely sure changing this (replicating changes from aborted 
xacts) is a good idea. Maybe it'd be better to replicate this "lazily" - 
instead of replicating the advances right away, we might remember which 
sequences were advanced in the transaction, and then replicate current 
state for those sequences at commit time.

The idea is that if an increment is "invisible" we probably don't need 
to replicate it, it's fine to replicate the next "visible" increment. So 
for example given

BEGIN;
SELECT nextval('s');
ROLLBACK;

BEGIN;
SELECT nextval('s');
COMMIT;

we don't need to replicate the first change, but we need to replicate 
the second one.

The trouble is we don't decode individual sequence advances, just those 
that update the sequence tuple (so every 32 values or so). So we'd need 
to remeber the first increment, in a way that is (a) persistent across 
restarts and (b) shared by all backends.

The other challenge seems to be ordering of the changes - at the moment 
we have no issues with this, because increments on the same sequence are 
replicated immediately, in the WAL order. But postponing them to commit 
time would affect this order.


I've also briefly looked at the code duplication - there's a couple 
functions in the patch that I shamelessly copy-pasted and tweaked to 
handle sequences instead of tables:

publicationcmds.c
-----------------

1) OpenTableList/CloseTableList - > OpenSequenceList/CloseSequenceList

Trivial differences, trivial to get rid of - the only difference is 
pretty much just table_open vs. relation open.


2) AlterPublicationTables -> AlterPublicationSequences

This is a bit more complicated, because the tables also handle 
inheritance (which I think does not apply to sequences). Other than 
that, it's calling the functions from (1).


subscriptioncmds.c
------------------

1) fetch_table_list, fetch_sequence_list

Minimal differences, essentially just the catalog name.

2) AlterSubscription_refresh

A lot of duplication, but the code handling tables and sequences is 
almost exactly the same and can be reused fairly easily (moved to a 
separate function, called for tables and then sequences).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PATCH] Hooks at XactCommand level
Next
From: Alvaro Herrera
Date:
Subject: Re: archive status ".ready" files may be created too early