Re: table_tuple_lock's snapshot argument - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: table_tuple_lock's snapshot argument |
Date | |
Msg-id | 190b4c9f-8173-4f0d-9877-a7aebb7c41c9@iki.fi Whole thread Raw |
In response to | table_tuple_lock's snapshot argument (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: table_tuple_lock's snapshot argument
|
List | pgsql-hackers |
On 21/01/2025 12:05, Alexander Korotkov wrote: > On Tue, Jan 21, 2025 at 12:03 PM Alexander Korotkov > <aekorotkov@gmail.com> wrote: >> On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> However, I think GetLatestSnapshot() is wrong here anyway. None of this >>> matters for heapam which just ignores the 'snapshot' argument, but let's >>> assume a different AM that would use the snapshot to identify the tuple >>> version. The TID was fetched from an index using an index scan with >>> SnapshotDirty. There's no guarantee that the LatestSnapshot would match >>> the same tuple version that the index scan found. If an MVCC snapshot is >>> needed, surely it should be acquired before the index scan, and used for >>> the index scan as well. >>> >>> I see three options: >>> >>> 1. Remove the 'snapshot' argument, since it's not used by heapam. If we >>> get a table AM where a single TID represents multiple row versions, this >>> will need to be revisited. >>> >>> 2. Rewrite the recheck code in execReplication.c so that it uses the >>> snapshot in a more consistent fashion. Call GetLatestSnapshot() first, >>> and use the same snapshot in the index scan and table_tuple_lock(). >>> Acquiring a snapshot isn't free though, so it would be nice to avoid >>> doing that when the heapam is just going to ignore it anyway. If we go >>> with this option, I think we could reuse the snapshot that is already >>> active in most cases, and only take a new snapshot if the tuple was >>> concurrently updated. >>> >>> 3. Modify the tableam interface so that the index scan can return a more >>> unique identifier of the tuple version. In heapam, it could be the TID >>> like today, but a different AM could return some other token. Continue >>> to use SnapshotDirty in the index scan, but in the call to >>> table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass >>> the token you got index_getnext_slot(). >> >> Thank you for your finding and analysis. I would vote for #3. >> Currently snapshot argument is not properly used, and the heapam code >> doesn't check it. I think allowing flexible tuple identifier is >> essential for future of table AM API. Therefore, any >> snapshot/visibility information could be put inside that tuple >> identifier if needed. > > To be more specific, I vote for #1 + #3. Remove the snapshot argument > for now, then work on flexible tuple identifier. I committed and backpatched the trivial fix for not, just replacing GetLatestSnapshot() with GetActiveSnapshot(). I got cold feet on just removing the argument without providing a replacement, because it's not 100% clear to me if the current situation is broken or not. To summarize the issue, RelationFindReplTupleByIndex() performs an index scan with DirtySnapshot, and then calls table_tuple_lock() with a fresh MVCC snapshot to lock the row. There's no guarantee that the index scan found the same row version that table_tuple_lock() will lock, if the TID alone doesn't uniquely identify it. But that's still OK as long as the key column hasn't changed, like with heapam's HOT updates. I couldn't convince myself that it's broken nor that it's guaranteed to be correct with other AMs. Thanks! -- Heikki Linnakangas Neon (https://neon.tech)
pgsql-hackers by date: