Re: logical decoding and replication of sequences - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical decoding and replication of sequences
Date
Msg-id 2cd38bab-c874-8e0b-98e7-d9abaaf9806a@enterprisedb.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: logical decoding and replication of sequences
Re: logical decoding and replication of sequences
List pgsql-hackers
Hi,

Here's a slightly improved version of the patch, fixing a couple issues 
I failed to notice. It also addresses a couple of the issues described 
in the last message, although mostly to show what would need to be done.

1) Handle sequences dropped in the xact by calling try_relation_open, 
and just doing nothing if not found. Otherwise we'd get a failure in 
reorderbuffer, when decoding the change.

2) Fixed nextval_internal() to log the correct last_value (the one we 
write into WAL).

3) Reread the sequence state in AtEOXact_Sequences, to deal with the 
ordering issue described before. This makes (2) somewhat pointless, 
because we just read whatever is on disk at that point. But having both 
makes it easier to experiment / see what'd happen.

4) Log the stats in DefineSequence() - Without this we'd not have the 
initial sequence state in the WAL, because only nextval/setval etc. do 
the logging. The old approach (decoding the sequence tuple) does not 
have this issue.


The (3) changes the behavior in a somewhat strange way. Consider this 
case with two concurrent transactions:

T1: BEGIN;
T2: BEGIN;
T1: SELECT nextval('s') FROM generate_series(1,100) s(i);
T2: SELECT nextval('s') FROM generate_series(1,100) s(i);
T1: COMMIT;
T2: COMMIT;

The result is that both transactions have used the same sequence, and so 
will re-read the state from disk. But at that point the state is exactly 
the same, so we'll log the same thing twice.

There's a much deeper issue, though. The current patch only logs the 
sequence if the session generated WAL when incrementing the sequence 
(which happens every 32 values). But other sessions may already use 
values from this range, so consider for example this:

T1: BEGIN;
T1: SELECT nextval('s') FROM generate_series(1,100) s(i);
T2: BEGIN;
T2: SELECT nextval('s');
T2: COMMIT;
T1: ROLLBACK;

Which unfortunately means T2 already used a value, but the increment may 
not be logged at that time (or ever). This seems like a fatal issue, 
because it means we need to log *all* sequences the transaction touches, 
not just those that wrote the increment to WAL. That might still work 
for large transactions consuming many sequence values, but it's pretty 
inefficient for small OLTP transactions that only need one or two values 
from the sequence.

So I think just decoding the sequence tuples is a better solution - for 
large transactions (consuming many values from the sequence) it may be 
more expensive (i.e. send more records to replica). But I doubt that 
matters too much - it's likely negligible compared to other data for 
large transactions.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Why not try for a HOT update, even when PageIsFull()?
Next
From: "Shinoda, Noriyoshi (PN Japan FSIP)"
Date:
Subject: RE: Improve logging when using Huge Pages