Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Deleting older versions in unique indexes to avoid page splits
Date
Msg-id CAH2-WzkU_3JqX0KZsffJf+bXQAhm87BHvLHVKdDtfJteGVs8Yg@mail.gmail.com
Whole thread Raw
In response to Re: Deleting older versions in unique indexes to avoid page splits  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Deleting older versions in unique indexes to avoid page splits  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Sat, Jan 16, 2021 at 9:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Do we do this optimization (bottom-up deletion) only when the last
> item which can lead to page split has indexUnchanged set to true? If
> so, what if the last tuple doesn't have indexUnchanged but the
> previous ones do have?

Using the indexUnchanged hint as a triggering condition makes sure
that non-HOT updaters are the ones that pay the cost of a bottom-up
deletion pass. We create backpressure for non-HOT updaters (in indexes
that are logically unchanged) specifically. (Of course it's also true
that the presence of the indexUnchanged hint is highly suggestive of
there being more version churn duplicates on the leaf page already,
which is not actually certain.)

It's possible that there will be some mixture of inserts and non-hot
updates on the same leaf page, and as you say the implementation might
fail to do a bottom-up pass based on the fact that an incoming item at
the point of a would-be page split was a plain insert (and so didn't
receive the hint). The possibility of these kinds of "missed
opportunities" are okay because a page split was inevitable either
way. You can imagine a case where that isn't true (i.e. a missed
opportunity to avoid a page split), but that's kind of like imagining
a fair coin flip taking place 100 times and coming up heads each time.
It just isn't realistic for such an "mixed tendencies" leaf page to
stay in flux (i.e. not ever split) forever, with the smartest
triggering condition in the world -- it's too unstable.

An important part of understanding the design is to imagine things at
the leaf page level, while thinking about how what that actually looks
like differs from an ideal physical representation of the same leaf
page that is much closer to the logical contents of the database.
We're only interested in leaf pages where the number of logical rows
is basically fixed over time (when there is version churn). Usually
this just means that there are lots of non-HOT updates, but it can
also work with lots of queue-like inserts and deletes in a unique
index.

Ultimately, the thing that determines whether or not the bottom-up
deletion optimization is effective for any given leaf page is the fact
that it actually prevents the page from splitting despite lots of
version churn -- this happens again and again. Another key concept
here is that bottom-up deletion passes for the same leaf page are
related in important ways -- it is often a mistake to think of
individual bottom-up passes as independent events.

> Why are we using terminology bottom-up deletion and why not simply
> duplicate version deletion or something on those lines?

Why is that simpler? Also, it doesn't exactly delete duplicates. See
my later remarks.

> How do we distinguish between version duplicate tuples (added due to
> the reason that some other index column is updated) or duplicate
> tuples (added because a user has inserted a row with duplicate value)
> for the purpose of bottom-up deletion?  I think we need to distinguish
> between them because in the earlier case we can remove the old version
> of the tuple in the index but not in later. Or is it the case that
> indexAm doesn't differentiate between them but relies on heapAM to
> give the correct answer?

Bottom-up deletion uses the presence of duplicates as a starting point
for determining which heap pages to visit, based on the assumption
that at least some are obsolete versions. But heapam.c has additional
heap-level heuristics that help too.

It is quite possible and even likely that we will delete some
non-duplicate tuples in passing, just because they're checked in
passing -- they might turn out to be deletable, for whatever reason.
We're also concerned (on the heapam.c side) about which heap pages
have the most TIDs (any kind of TID, not just one marked promising in
index AM), so this kind of "serendipity" is quite common in practice.
Often the total number of heap pages that are pointed to by all index
tuples on the page just isn't that many (8 - 10). And often cases with
lots of HOT pruning can have hundreds of LP_DEAD item pointers on a
heap page, which we'll tend to get to before too long anyway (with or
without many duplicates).

The nbtree/caller side makes inferences about what is likely to be
true about the "logical contents" of the physical leaf page, as a
starting point for the heuristic-driven search for deletable index
tuples. There are various ways in which each inference (considered
individually) might be wrong, including the one that you pointed out:
inserts of duplicates will look like update version churn. But that
particular issue won't matter if the inserted duplicates are on
mostly-distinct heap blocks (which is likely), because then they'll
only make a noise level contribution to the behavior in heapam.c.
Also, we can fall back on deduplication when bottom-up deletion fails,
which will merge together the duplicates-from-insert, so now any
future bottom-up deletion pass over the same leaf page won't have the
same problem.

Bear in mind that heapam.c is only looking for a select few heap
blocks, and doesn't even need to get the order exactly right. We're
only concerned about extremes, which are actually what we see in cases
that we're interested in helping. We only have to be very
approximately right, or right on average. Sure, there might be some
tiny cost in marginal cases, but that's not something that I've been
able to quantify in any practical way. Because once we fail we fail
for good -- the page splits and the question of doing a bottom-up
deletion pass for that same leaf page ends.

Another important concept is the cost asymmetry -- the asymmetry here
is huge. Leaf page splits driven by version churn are very expensive
in the short term and in the long term. Our previous behavior was to
assume that they were necessary. Now we're initially assuming that
they're unnecessary, and requiring non-HOT updaters to do some work
that shows (with some margin of error) that such a split is in fact
necessary. This is a form of backpressure.

Bottom-up deletion doesn't intervene unless and until that happens. It
is only interested in pages where version churn is concentrated -- it
is absolutely fine to leave it up to VACUUM to get to any "floating
garbage" tuples later. This is a pathological condition, and as such
isn't hard to spot, regardless of workload conditions.

If you or anyone else can think of a gap in my reasoning, or a
workload in which the heuristics either fail to prevent page splits
where that might be expected or impose too high a cost, do let me
know. I admit that my design is unorthodox, but the results speak for
themselves.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Chapman Flack
Date:
Subject: Re: cgit view availabel
Next
From: "Joel Jacobson"
Date:
Subject: Re: evtfoid and evtowner missing in findoidjoins/README