Hi,
Here's a slightly improved version of the patch, fixing a couple issues
I failed to notice. It also addresses a couple of the issues described
in the last message, although mostly to show what would need to be done.
1) Handle sequences dropped in the xact by calling try_relation_open,
and just doing nothing if not found. Otherwise we'd get a failure in
reorderbuffer, when decoding the change.
2) Fixed nextval_internal() to log the correct last_value (the one we
write into WAL).
3) Reread the sequence state in AtEOXact_Sequences, to deal with the
ordering issue described before. This makes (2) somewhat pointless,
because we just read whatever is on disk at that point. But having both
makes it easier to experiment / see what'd happen.
4) Log the stats in DefineSequence() - Without this we'd not have the
initial sequence state in the WAL, because only nextval/setval etc. do
the logging. The old approach (decoding the sequence tuple) does not
have this issue.
The (3) changes the behavior in a somewhat strange way. Consider this
case with two concurrent transactions:
T1: BEGIN;
T2: BEGIN;
T1: SELECT nextval('s') FROM generate_series(1,100) s(i);
T2: SELECT nextval('s') FROM generate_series(1,100) s(i);
T1: COMMIT;
T2: COMMIT;
The result is that both transactions have used the same sequence, and so
will re-read the state from disk. But at that point the state is exactly
the same, so we'll log the same thing twice.
There's a much deeper issue, though. The current patch only logs the
sequence if the session generated WAL when incrementing the sequence
(which happens every 32 values). But other sessions may already use
values from this range, so consider for example this:
T1: BEGIN;
T1: SELECT nextval('s') FROM generate_series(1,100) s(i);
T2: BEGIN;
T2: SELECT nextval('s');
T2: COMMIT;
T1: ROLLBACK;
Which unfortunately means T2 already used a value, but the increment may
not be logged at that time (or ever). This seems like a fatal issue,
because it means we need to log *all* sequences the transaction touches,
not just those that wrote the increment to WAL. That might still work
for large transactions consuming many sequence values, but it's pretty
inefficient for small OLTP transactions that only need one or two values
from the sequence.
So I think just decoding the sequence tuples is a better solution - for
large transactions (consuming many values from the sequence) it may be
more expensive (i.e. send more records to replica). But I doubt that
matters too much - it's likely negligible compared to other data for
large transactions.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company