Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Deleting older versions in unique indexes to avoid page splits |
Date | |
Msg-id | CAH2-Wz=zOxjVdz0ZDch=Hyh07SWkorRvD+HGUcT5xMj-sSFt5w@mail.gmail.com Whole thread Raw |
In response to | Re: Deleting older versions in unique indexes to avoid page splits (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Deleting older versions in unique indexes to avoid page splits
|
List | pgsql-hackers |
On Wed, Oct 21, 2020 at 8:25 AM Robert Haas <robertmhaas@gmail.com> wrote: > That certainly isn't great. I mean, it might be not be too terrible, > because it's a leaf index page isn't nearly as potentially hot as a VM > page or a clog page, but it hurts interruptibility and risks hurting > concurrency, but if it were possible to arrange to hold only a pin on > the page during all this rather than a lock, it would be better. I'm > not sure how realistic that is, though. I don't think that it's realistic. Well, technically you could do something like that, but you'd end up with some logically equivalent mechanism which would probably be slower. As you know, in nbtree pins are generally much less helpful than within heapam (you cannot read a page without a shared buffer lock, no matter what). Holding a pin only provides a very weak guarantee about VACUUM and TID recycling that usually doesn't come up. Bear in mind that we actually do practically the same thing all the time with the current LP_DEAD setting stuff, where we need to call compute_xid_horizon_for_tuples/heap_compute_xid_horizon_for_tuples with a leaf buffer lock held in almost the same way. That's actually potentially far worse if you look at it in isolation, because you could potentially have hundreds of heap pages, whereas this is just 1 - 3. (BTW, next version will also do that work in passing, so you're practically guaranteed to do less with a buffer lock held compared to the typical case of nbtree LP_DEAD setting, even without counting how the LP_DEAD bits get set in the first place.) I could also point out that something very similar happens in _bt_check_unique(). Also bear in mind that the alternative is pretty much a page split, which means: * Locking the leaf page * Then obtaining relation extension lock * Locking to create new right sibling * Releasing relation extension lock * Locking original right sibling page * Release original right sibling page * Release new right sibling page * Lock parent page * Release original now-split page * Release parent page (I will refrain from going into all of the absurd and near-permanent secondary costs that just giving up and splitting the page imposes for now. I didn't even include all of the information about locking -- there is one thing that didn't seem worth mentioning.) The key concept here is of course asymmetry. The asymmetry here is not only favorable; it's just outrageous. The other key concept is it's fundamentally impossible to pay more than a very small fixed cost without getting a benefit. That said, I accept that there is still some uncertainty that all workloads that get a benefit will be happy with the trade-off. I am still fine tuning how this works in cases with high contention. I welcome any help with that part. But note that this doesn't necessarily have much to do with the heap page accesses. It's not always strictly better to never have any bloat at all (it's pretty close to that, but not quite). We saw this with the Postgres 12 work, where small TPC-C test cases had some queries go slower simply because a small highly contended index did not get bloated due to a smarter split algorithm. There is no reason to believe that it had anything to do with the cost of making better decisions. It was the decisions themselves. I don't want to completely prevent "version driven page splits" (though a person could reasonably imagine that that is in fact my precise goal); rather, I want to make non-hot updates work to prove that it's almost certainly necessary to split the page due to version churn - then and only then should it be accepted. Currently we meekly roll over and let non-hot updaters impose negative externalities on the system as a whole. The patch usually clearly benefits even workloads that consist entirely of non-hot updaters. Negative externalities are only good for the individual trying to impose costs on the collective when they can be a true freeloader. It's always bad for the collective, but it's even bad for the bad actors once they're more than a small minority. Currently non-hot updaters are not merely selfish to the extent that they impose a downside on the collective or the system as a whole that is roughly proportionate to the upside benefit they get. Not cleaning up their mess as they go creates a downside that is a huge multiple of any possible upside for them. To me this seems incontrovertible. Worrying about the precise extent to which this is true in each situation doesn't seem particularly productive to me. Whatever the actual extent of the imbalance is, the solution is that you don't let them do that. This patch is not really about overall throughput. It could be justified on that basis, but that's not how I like to think of it. Rather, it's about providing a stabilizing backstop mechanism, which tends to bound the amount of index bloat and the number of versions in each index for each *logical row* -- that's the most important benefit of the patch. There are workloads that will greatly benefit despite only invoking the new mechanism very occasionally, as a backstop. And even cases with a fair amount of contention don't really use it that often (which is why the heap page access cost is pretty much a question about specific high contention patterns only). The proposed new cleanup mechanism may only be used in certain parts of the key space for certain indexes at certain times, in a bottom-up fashion. We don't have to be eager about cleaning up bloat most of the time, but it's also true that there are cases where we ought to work very hard at it in a localized way. This explanation may sound unlikely, but the existing behaviors taken together present us with outrageous cost/benefit asymmetry, arguably in multiple dimensions. I think that having this backstop cleanup mechanism (and likely others in other areas) will help to make the assumptions underlying autovacuum scheduling much more reasonable in realistic settings. Now it really is okay that autovacuum doesn't really care about the needs of queries, and is largely concerned with macro level things like free space management. It's top down approach isn't so bad once it has true bottom up complementary mechanisms. The LP_DEAD microvacuum stuff is nice because it marks things as dead in passing, pretty much for free. That's not enough on its own -- it's no backstop. The current LP_DEAD stuff appears to work rather well, until one day it suddenly doesn't and you curse Postgres for it. I could go on about the non-linear nature of the system as a whole, hidden tipping points, and other stuff like that. But I won't right now. -- Peter Geoghegan
pgsql-hackers by date: