On 12/18/21 07:00, Tomas Vondra wrote:
>
>
> On 12/18/21 05:52, Tom Lane wrote:
>> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>>> The problem is exactly the same as in [1] - the aborted transaction
>>> generated WAL, but RecordTransactionAbort() ignores that and does not
>>> update LogwrtResult.Write, with the reasoning that aborted transactions
>>> do not matter. But sequences violate that, because we only write WAL
>>> once every 32 increments, so the following nextval() gets "committed"
>>> without waiting for the replica (because it did not produce WAL).
>>
>> Ugh.
>>
>>> I'm not sure this is a clear data corruption bug, but it surely walks
>>> and quacks like one. My proposal is to fix this by tracking the lsn of
>>> the last LSN for a sequence increment, and then check that LSN in
>>> RecordTransactionCommit() before calling XLogFlush().
>>
>> (1) Does that work if the aborted increment was in a different
>> session? I think it is okay but I'm tired enough to not be sure.
>>
>
> Good point - it doesn't :-( At least not by simply storing LSN in a
> global variable or something like that.
>
> The second backend needs to know the LSN of the last WAL-logged sequence
> increment, but only the first backend knows that. So we'd need to share
> that between backends somehow. I doubt we want to track LSN for every
> individual sequence (because for clusters with many dbs / sequences that
> may be a lot).
>
> Perhaps we could track just a fixed number o LSN values in shared memory
> (say, 1024), and update/read just the element determined by hash(oid).
> That is, the backend WAL-logging sequence with given oid would set the
> current LSN to array[hash(oid) % 1024], and backend doing nextval()
> would simply remember the LSN in that slot. Yes, if there are conflicts
> that'll flush more than needed.
>
Here's a PoC demonstrating this idea. I'm not convinced it's the right
way to deal with this - it surely seems more like a duct tape fix than a
clean solution. But it does the trick.
I wonder if storing this in shmem is good enough - we lose the LSN info
on restart, but the checkpoint should trigger FPI which makes it OK.
A bigger question is whether sequences are the only thing affected by
this. If you look at RecordTransactionCommit() then we skip flush/wait
in two cases:
1) !wrote_xlog - if the xact did not produce WAL
2) !markXidCommitted - if the xact does not have a valid XID
Both apply to sequences, and the PoC patch tweaks them. But maybe there
are other places where we don't generate WAL and/or assign XID in some
cases, to save time?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company