Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Deleting older versions in unique indexes to avoid page splits |
Date | |
Msg-id | CAH2-WzkU_3JqX0KZsffJf+bXQAhm87BHvLHVKdDtfJteGVs8Yg@mail.gmail.com Whole thread Raw |
In response to | Re: Deleting older versions in unique indexes to avoid page splits (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Deleting older versions in unique indexes to avoid page splits
|
List | pgsql-hackers |
On Sat, Jan 16, 2021 at 9:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Do we do this optimization (bottom-up deletion) only when the last > item which can lead to page split has indexUnchanged set to true? If > so, what if the last tuple doesn't have indexUnchanged but the > previous ones do have? Using the indexUnchanged hint as a triggering condition makes sure that non-HOT updaters are the ones that pay the cost of a bottom-up deletion pass. We create backpressure for non-HOT updaters (in indexes that are logically unchanged) specifically. (Of course it's also true that the presence of the indexUnchanged hint is highly suggestive of there being more version churn duplicates on the leaf page already, which is not actually certain.) It's possible that there will be some mixture of inserts and non-hot updates on the same leaf page, and as you say the implementation might fail to do a bottom-up pass based on the fact that an incoming item at the point of a would-be page split was a plain insert (and so didn't receive the hint). The possibility of these kinds of "missed opportunities" are okay because a page split was inevitable either way. You can imagine a case where that isn't true (i.e. a missed opportunity to avoid a page split), but that's kind of like imagining a fair coin flip taking place 100 times and coming up heads each time. It just isn't realistic for such an "mixed tendencies" leaf page to stay in flux (i.e. not ever split) forever, with the smartest triggering condition in the world -- it's too unstable. An important part of understanding the design is to imagine things at the leaf page level, while thinking about how what that actually looks like differs from an ideal physical representation of the same leaf page that is much closer to the logical contents of the database. We're only interested in leaf pages where the number of logical rows is basically fixed over time (when there is version churn). Usually this just means that there are lots of non-HOT updates, but it can also work with lots of queue-like inserts and deletes in a unique index. Ultimately, the thing that determines whether or not the bottom-up deletion optimization is effective for any given leaf page is the fact that it actually prevents the page from splitting despite lots of version churn -- this happens again and again. Another key concept here is that bottom-up deletion passes for the same leaf page are related in important ways -- it is often a mistake to think of individual bottom-up passes as independent events. > Why are we using terminology bottom-up deletion and why not simply > duplicate version deletion or something on those lines? Why is that simpler? Also, it doesn't exactly delete duplicates. See my later remarks. > How do we distinguish between version duplicate tuples (added due to > the reason that some other index column is updated) or duplicate > tuples (added because a user has inserted a row with duplicate value) > for the purpose of bottom-up deletion? I think we need to distinguish > between them because in the earlier case we can remove the old version > of the tuple in the index but not in later. Or is it the case that > indexAm doesn't differentiate between them but relies on heapAM to > give the correct answer? Bottom-up deletion uses the presence of duplicates as a starting point for determining which heap pages to visit, based on the assumption that at least some are obsolete versions. But heapam.c has additional heap-level heuristics that help too. It is quite possible and even likely that we will delete some non-duplicate tuples in passing, just because they're checked in passing -- they might turn out to be deletable, for whatever reason. We're also concerned (on the heapam.c side) about which heap pages have the most TIDs (any kind of TID, not just one marked promising in index AM), so this kind of "serendipity" is quite common in practice. Often the total number of heap pages that are pointed to by all index tuples on the page just isn't that many (8 - 10). And often cases with lots of HOT pruning can have hundreds of LP_DEAD item pointers on a heap page, which we'll tend to get to before too long anyway (with or without many duplicates). The nbtree/caller side makes inferences about what is likely to be true about the "logical contents" of the physical leaf page, as a starting point for the heuristic-driven search for deletable index tuples. There are various ways in which each inference (considered individually) might be wrong, including the one that you pointed out: inserts of duplicates will look like update version churn. But that particular issue won't matter if the inserted duplicates are on mostly-distinct heap blocks (which is likely), because then they'll only make a noise level contribution to the behavior in heapam.c. Also, we can fall back on deduplication when bottom-up deletion fails, which will merge together the duplicates-from-insert, so now any future bottom-up deletion pass over the same leaf page won't have the same problem. Bear in mind that heapam.c is only looking for a select few heap blocks, and doesn't even need to get the order exactly right. We're only concerned about extremes, which are actually what we see in cases that we're interested in helping. We only have to be very approximately right, or right on average. Sure, there might be some tiny cost in marginal cases, but that's not something that I've been able to quantify in any practical way. Because once we fail we fail for good -- the page splits and the question of doing a bottom-up deletion pass for that same leaf page ends. Another important concept is the cost asymmetry -- the asymmetry here is huge. Leaf page splits driven by version churn are very expensive in the short term and in the long term. Our previous behavior was to assume that they were necessary. Now we're initially assuming that they're unnecessary, and requiring non-HOT updaters to do some work that shows (with some margin of error) that such a split is in fact necessary. This is a form of backpressure. Bottom-up deletion doesn't intervene unless and until that happens. It is only interested in pages where version churn is concentrated -- it is absolutely fine to leave it up to VACUUM to get to any "floating garbage" tuples later. This is a pathological condition, and as such isn't hard to spot, regardless of workload conditions. If you or anyone else can think of a gap in my reasoning, or a workload in which the heuristics either fail to prevent page splits where that might be expected or impose too high a cost, do let me know. I admit that my design is unorthodox, but the results speak for themselves. -- Peter Geoghegan
pgsql-hackers by date: