Re: logical decoding and replication of sequences, take 2 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: logical decoding and replication of sequences, take 2
Date
Msg-id 195be236-7580-72c6-97c7-59f06c4e093c@enterprisedb.com
Whole thread Raw
In response to Re: logical decoding and replication of sequences, take 2  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: logical decoding and replication of sequences, take 2
List pgsql-hackers
On 11/28/23 12:32, Amit Kapila wrote:
> On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I spent a bit of time looking at the proposed change, and unfortunately
>> logging just the boolean flag does not work. A good example is this bit
>> from a TAP test added by the patch for built-in replication (which was
>> not included with the WIP patch):
>>
>>   BEGIN;
>>   ALTER SEQUENCE s RESTART WITH 1000;
>>   SAVEPOINT sp1;
>>   INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100);
>>   ROLLBACK TO sp1;
>>   COMMIT;
>>
>> This is expected to produce:
>>
>>   1131|0|t
>>
>> but produces
>>
>>   1000|0|f
>>
>> instead. The reason is very simple - as implemented, the patch simply
>> checks if the relfilenode is from the same top-level transaction, which
>> it is, and sets the flag to "true". So we know the sequence changes need
>> to be queued and replayed as part of this transaction.
>>
>> But then during decoding, we still queue the changes into the subxact,
>> which then aborts, and the changes are discarded. That is not how it's
>> supposed to work, because the new relfilenode is still valid, someone
>> might do nextval() and commit. And the nextval() may not get WAL-logged,
>> so we'd lose this.
>>
>> What I guess we might do is log not just a boolean flag, but the XID of
>> the subtransaction that created the relfilenode. And then during
>> decoding we'd queue the changes into this subtransaction ...
>>
>> 0006 in the attached patch series does this, and it seems to fix the TAP
>> test failure. I left it at the end, to make it easier to run tests
>> without the patch applied.
>>
> 
> Offhand, I don't have any better idea than what you have suggested for
> the problem but this needs some thoughts including the questions asked
> by you. I'll spend some time on it and respond back.
> 

I've been experimenting with the idea to log the XID, and for a moment I
was worried it actually can't work, because subtransactions may not
actually be just nested in simple way, but form a tree. And what if the
sequence was altered in a different branch (sibling subxact), not in the
immediate parent. In which case the new SubTransactionGetXid() would
fail, because it just walks the current chain of subtransactions.

I've been thinking about cases like this:

   BEGIN;
   CREATE SEQUENCE s;        # XID 1000
   SELECT alter_sequence();  # XID 1001
   SAVEPOINT s1;
   SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000
   ROLLBACK TO s1;
   SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000
   COMMIT;

The XID values are what the sequence wal record will reference, assuming
that the main transaction XID is 1000.

Initially, I thought it's wrong that the nextval() calls reference XID
of the main transaction, because the last relfilenode comes from 1001,
which is the subxact created by alter_sequence() thanks to the exception
handling block. And that's where the approach in reorderbuffer would
queue the changes.

But I think this is actually correct too. When a subtransaction commits
(e.g. when alter_sequence() completes), it essentially becomes part of
the parent. And AtEOSubXact_cleanup() updates rd_newRelfilelocatorSubid
accordingly, setting it to parentSubid.

This also means that SubTransactionGetXid() can't actually fail, because
the ID has to reference an active subtransaction in the current stack.
I'm still concerned about the cost of the lookup, because the list may
be long and the subxact we're looking for may be quite high, but I guess
we might have another field, caching the XID. It'd need to be updated
only in AtEOSubXact_cleanup, and at that point we know it's the
immediate parent, so it'd be pretty cheap I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Partial aggregates pushdown
Next
From: Stephen Frost
Date:
Subject: Re: [HACKERS] Changing references of password encryption to hashing