Hello, everyone and Peter!
Peter, I have added you because you may be interested in (or already know about) this btree-related issue.
Short description of the problem:
I noticed a concurrency issue in btree index scans that affects SnapshotDirty and SnapshotSelf scan types.
When using these non-MVCC snapshot types, a scan could miss tuples if concurrent transactions delete existing tuples and insert new one with different TIDs on the same page.
The problem occurs because:
1. The scan reads a page and caches its tuples in backend-local storage
2. A concurrent transaction deletes a tuple and inserts a new one with a different TID
3. The scan misses the new tuple because it was already deleted by a committed transaction and does not pass visibility check
4. But new version on the page is missed, because not in cached tuples
This may cause issues with:
- logical replication (RelationFindReplTupleByIndex fail) - invalid conflict message (MISSING instead of ORIGIN_DIFFERS), probably other issues with upcoming conflict resolution for logical replication
- check_exclusion_or_unique_constraint false negative (but currently it does not cause any real issues as far as I can see)
The fix implemented in this version of the patch:
- Retains the read lock on a page for SnapshotDirty and SnapshotSelf scans until we're completely done with all tuples from that page
- Introduces a new 'extra_unlock' field in BTScanPos to track when a lock is being held longer than usual
- Updates documentation to explain this special locking behavior
Yes, it may cause some degradation in performance because of that additional lock.
Another possible idea is to use a fresh MVCC snapshot for such cases (but I think it is still better to fix or at least document that issue anyway).
Best regards,
Mikhail.