Re: table_tuple_lock's snapshot argument - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: table_tuple_lock's snapshot argument
Date
Msg-id 190b4c9f-8173-4f0d-9877-a7aebb7c41c9@iki.fi
Whole thread Raw
In response to table_tuple_lock's snapshot argument  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: table_tuple_lock's snapshot argument
List pgsql-hackers
On 21/01/2025 12:05, Alexander Korotkov wrote:
> On Tue, Jan 21, 2025 at 12:03 PM Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
>> On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> However, I think GetLatestSnapshot() is wrong here anyway. None of this
>>> matters for heapam which just ignores the 'snapshot' argument, but let's
>>> assume a different AM that would use the snapshot to identify the tuple
>>> version. The TID was fetched from an index using an index scan with
>>> SnapshotDirty. There's no guarantee that the LatestSnapshot would match
>>> the same tuple version that the index scan found. If an MVCC snapshot is
>>> needed, surely it should be acquired before the index scan, and used for
>>> the index scan as well.
>>>
>>> I see three options:
>>>
>>> 1. Remove the 'snapshot' argument, since it's not used by heapam. If we
>>> get a table AM where a single TID represents multiple row versions, this
>>> will need to be revisited.
>>>
>>> 2. Rewrite the recheck code in execReplication.c so that it uses the
>>> snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
>>> and use the same snapshot in the index scan and table_tuple_lock().
>>> Acquiring a snapshot isn't free though, so it would be nice to avoid
>>> doing that when the heapam is just going to ignore it anyway. If we go
>>> with this option, I think we could reuse the snapshot that is already
>>> active in most cases, and only take a new snapshot if the tuple was
>>> concurrently updated.
>>>
>>> 3. Modify the tableam interface so that the index scan can return a more
>>> unique identifier of the tuple version. In heapam, it could be the TID
>>> like today, but a different AM could return some other token. Continue
>>> to use SnapshotDirty in the index scan, but in the call to
>>> table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
>>> the token you got index_getnext_slot().
>>
>> Thank you for your finding and analysis.  I would vote for #3.
>> Currently snapshot argument is not properly used, and the heapam code
>> doesn't check it.  I think allowing flexible tuple identifier is
>> essential for future of table AM API.  Therefore, any
>> snapshot/visibility information could be put inside that tuple
>> identifier if needed.
> 
> To be more specific, I vote for #1 + #3. Remove the snapshot argument
> for now, then work on flexible tuple identifier.

I committed and backpatched the trivial fix for not, just replacing 
GetLatestSnapshot() with GetActiveSnapshot(). I got cold feet on just 
removing the argument without providing a replacement, because it's not 
100% clear to me if the current situation is broken or not.

To summarize the issue, RelationFindReplTupleByIndex() performs an index 
scan with DirtySnapshot, and then calls table_tuple_lock() with a fresh 
MVCC snapshot to lock the row. There's no guarantee that the index scan 
found the same row version that table_tuple_lock() will lock, if the TID 
alone doesn't uniquely identify it. But that's still OK as long as the 
key column hasn't changed, like with heapam's HOT updates. I couldn't 
convince myself that it's broken nor that it's guaranteed to be correct 
with other AMs.

Thanks!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Changing the state of data checksums in a running cluster
Next
From: Naga Appani
Date:
Subject: [Proposal] Expose internal MultiXact member count function for efficient monitoring