Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAH2-WznsJqDgr_0yUwApgYXi3cRZQbimFkiYRqqXhpMcw4s8ZQ@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: index prefetching
Re: index prefetching |
List | pgsql-hackers |
On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 2/7/24 22:48, Melanie Plageman wrote: > I admit I haven't thought about kill_prior_tuple until you pointed out. > Yeah, prefetching separates (de-synchronizes) the two scans (index and > heap) in a way that prevents this optimization. Or at least makes it > much more complex :-( Another thing that argues against doing this is that we might not need to visit any more B-Tree leaf pages when there is a LIMIT n involved. We could end up scanning a whole extra leaf page (including all of its tuples) for want of the ability to "push down" a LIMIT to the index AM (that's not what happens right now, but it isn't really needed at all right now). This property of index scans is fundamental to how index scans work. Pinning an index page as an interlock against concurrently TID recycling by VACUUM is directly described by the index API docs [1], even (the docs actually use terms like "buffer pin" rather than something more abstract sounding). I don't think that anything affecting that behavior should be considered an implementation detail of the nbtree index AM as such (nor any particular index AM). I think that it makes sense to put the index AM in control here -- that almost follows from what I said about the index AM API. The index AM already needs to be in control, in about the same way, to deal with kill_prior_tuple (plus it helps with the LIMIT issue I described). There doesn't necessarily need to be much code duplication to make that work. Offhand I suspect it would be kind of similar to how deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets by with generic logic implemented by index_compute_xid_horizon_for_tuples -- that's all that we need to determine a snapshotConflictHorizon value for recovery conflict purposes. Note that index_compute_xid_horizon_for_tuples() reads *index* pages, despite not being aware of the caller's index AM and index tuple format. (The only reason why nbtree needs a custom solution is because it has posting list tuples to worry about, unlike GiST and unlike Hash, which consistently use unadorned generic IndexTuple structs with heap TID represented in the standard/generic way only. While these concepts probably all originated in nbtree, they're still not nbtree implementation details.) > > Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps > > there is an easier way to fix this, as I don't think the mvcc test > > failed on Tomas' version. > > > > I kinda doubt it worked correctly, considering I simply ignored the > optimization. It's far more likely it just worked by luck. The test that did fail will have only revealed that the kill_prior_tuple wasn't operating as expected -- which isn't the same thing as giving wrong answers. Note that there are various ways that concurrent TID recycling might prevent _bt_killitems() from setting LP_DEAD bits. It's totally unsurprising that breaking kill_prior_tuple in some way could be missed. Andres wrote the MVCC test in question precisely because certain aspects of kill_prior_tuple were broken for months without anybody noticing. [1] https://www.postgresql.org/docs/devel/index-locking.html -- Peter Geoghegan
pgsql-hackers by date: