Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Deleting older versions in unique indexes to avoid page splits |
Date | |
Msg-id | CAH2-WzkJ1sO6inwECttDDGdts9YSsQQM_8h3_+8DO1Bmykj_ug@mail.gmail.com Whole thread Raw |
In response to | Re: Deleting older versions in unique indexes to avoid page splits (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Deleting older versions in unique indexes to avoid page splits
Re: Deleting older versions in unique indexes to avoid page splits |
List | pgsql-hackers |
On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > This is a wholly new concept with a lot of heuristics. It's a lot of > swallow. Thanks for taking a look! Yes, it's a little unorthodox. Ideally, you'd find time to grok the patch and help me codify the design in some general kind of way. If there are general lessons to be learned here (and I suspect that there are), then this should not be left to chance. The same principles can probably be applied in heapam, for example. Even if I'm wrong about the techniques being highly generalizable, it should still be interesting! (Something to think about a little later.) Some of the intuitions behind the design might be vaguely familiar to you as the reviewer of my earlier B-Tree work. In particular, the whole way that we reason about how successive related operations (in this case bottom-up deletion passes) affect individual leaf pages over time might give you a feeling of déjà vu. It's a little like the nbtsplitloc.c stuff that we worked on together during the Postgres 12 cycle. We can make what might seem like rather bold assumptions about what's going on, and adapt to the workload. This is okay because we're sure that the downside of our being wrong is a fixed, low performance penalty. (To a lesser degree it's okay because the empirical evidence shows that we're almost always right, because we apply the optimization again and again precisely because it worked out last time.) I like to compare it to the familiar "rightmost leaf page applies leaf fillfactor" heuristic, which applies an assumption that is wrong in the general case, but nevertheless obviously helps enormously as a practical matter. Of course it's still true that anybody reviewing this patch ought to start with a principled skepticism of this claim -- that's how you review any patch. I can say for sure that that's the idea behind the patch, though. I want to be clear about that up front, to save you time -- if this claim is wrong, then I'm wrong about everything. Generational garbage collection influenced this work, too. It also applies pragmatic assumptions about where garbage is likely to appear. Assumptions that are based on nothing more than empirical observations about what is likely to happen in the real world, that are validated by experience and by benchmarking. > On 30/11/2020 21:50, Peter Geoghegan wrote: > > +} TM_IndexDelete; > > +} TM_IndexStatus; > Is it really significantly faster to have two arrays? If you had just > one array, each element would be only 12 bytes long. That's not much > smaller than TM_IndexDeletes, which is 8 bytes. Yeah, but the swap operations really matter here. At one point I found that to be the case, anyway. That might no longer be true, though. It might be that the code became less performance critical because other parts of the design improved, resulting in the code getting called less frequently. But if that is true then it has a lot to do with the power-of-two bucketing that you go on to ask about -- that helped performance a lot in certain specific cases (as I go into below). I will add a TODO item for myself, to look into this again. You may well be right. > > + /* First sort caller's array by TID */ > > + heap_tid_shellsort(delstate->deltids, delstate->ndeltids); > Instead of sorting the array by TID, wouldn't it be enough to sort by > just the block numbers? I don't understand. Yeah, I guess that we could do our initial sort of the deltids array (the heap_tid_shellsort() call) just using BlockNumber (not TID). But OTOH there might be some small locality benefit to doing a proper TID sort at the level of each heap page. And even if there isn't any such benefit, does it really matter? If you are asking about the later sort of the block counts array (which helps us sort the deltids array a second time, leaving it in its final order for bottom-up deletion heapam.c processing), then the answer is no. This block counts metadata array sort is useful because it allows us to leave the deltids array in what I believe to be the most useful order for processing. We'll access heap blocks primarily based on the number of promising tuples (though as I go into below, sometimes the number of promising tuples isn't a decisive influence on processing order). > > * While in general the presence of promising tuples (the hint that index > > + * AMs provide) is the best information that we have to go on, it is based > > + * on simple heuristics about duplicates in indexes that are understood to > > + * have specific flaws. We should not let the most promising heap pages > > + * win or lose on the basis of _relatively_ small differences in the total > > + * number of promising tuples. Small differences between the most > > + * promising few heap pages are effectively ignored by applying this > > + * power-of-two bucketing scheme. > > + * > > What are the "specific flaws"? I just meant the obvious: the index AM doesn't actually know for sure that there are any old versions on the leaf page that it will ultimately be able to delete. This uncertainty needs to be managed, including inside heapam.c. Feel free to suggest a better way of expressing that sentiment. > I understand that this is all based on rough heuristics, but is there > any advantage to rounding/bucketing the numbers like this? Even if small > differences in the total number of promising tuple is essentially noise > that can be safely ignored, is there any harm in letting those small > differences guide the choice? Wouldn't it be the essentially the same as > picking at random, or are those small differences somehow biased? Excellent question! It actually helps enormously, especially with low cardinality data that makes good use of the deduplication optimization (where it is especially important to keep the costs and the benefits in balance). This has been validated by benchmarking. This design naturally allows the heapam.c code to take advantage of both temporal and spatial locality. For example, let's say that you have several indexes all on the same table, which get lots of non-HOT UPDATEs, which are skewed. Naturally, the heap TIDs will match across each index -- these are index entries that are needed to represent successor versions (which are logically unchanged/version duplicate index tuples from the point of view of nbtree/nbtdedup.c). Introducing a degree of determinism to the order in which heap blocks are processed naturally takes advantage of the correlated nature of the index bloat. It naturally makes it much more likely that the also-correlated bottom-up index deletion passes (that occur across indexes on the same table) each process the same heap blocks close together in time -- with obvious performance benefits. In the extreme (but not particularly uncommon) case of non-HOT UPDATEs with many low cardinality indexes, each heapam.c call will end up doing *almost the same thing* across indexes. So we're making the correlated nature of the bloat (which is currently a big problem) work in our favor -- turning it on its head, you could say. Highly correlated bloat is not the exception -- it's actually the norm in the cases we're targeting here. If it wasn't highly correlated then it would already be okay to rely on VACUUM to get around to cleaning it later. This power-of-two bucketing design probably also helps when there is only one index. I could go into more detail on that, plus other variations, but perhaps the multiple index example suffices for now. I believe that there are a few interesting kinds of correlations here -- I bet you can think of some yourself. (Of course it's also possible and even likely that heap block correlation won't be important at all, but my response is "what specifically is the harm in being open to the possibility?".) BTW, I tried to make the tableam interface changes more or less compatible with Zedstore, which is notable for not using TIDs in the same way as heapam (or zheap). Let me know what you think about that. I can go into detail about it if it isn't obvious to you. -- Peter Geoghegan
pgsql-hackers by date: