Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Deleting older versions in unique indexes to avoid page splits
Date
Msg-id CAH2-Wz=rPkB5vXS7AZ+v8t3VE75v_dKGro+w4nWd64E9yiCLEQ@mail.gmail.com
Whole thread Raw
In response to Re: Deleting older versions in unique indexes to avoid page splits  (Victor Yegorov <vyegorov@gmail.com>)
List pgsql-hackers
On Wed, Nov 25, 2020 at 1:20 PM Victor Yegorov <vyegorov@gmail.com> wrote:
> In the _bt_delete_or_dedup_one_page() we start with the simple loop over items on the page and
> if there're any LP_DEAD tuples, we're kicking off _bt_delitems_delete().

Right.

> So if I understood you right, you plan to make this loop (or a similar one somewhere around)
> to track TIDs of the LP_DEAD tuples and then (perhaps on a second loop over the page) compare all other
> currently-not-LP_DEAD tuples and mark those pages, that have at least 2 TIDs pointing at (one LP_DEAD and other not)
> as a promising one.

Yes. We notice extra TIDs that can be included in our heapam test "for
free". The cost is low, but the benefits are also often quite high: in
practice there are *natural* correlations that we can exploit.

For example: maybe there were non-HOT updates, and some but not all of
the versions got marked LP_DEAD. We can get them all in one go,
avoiding a true bottom-up index deletion pass for much longer
(compared to doing LP_DEAD deletion the old way, which is what happens
in v9 of the patch). We're better off doing the deletions all at once.
It's cheaper.

(We still really need to have bottom-up deletion passes, of course,
because that covers the important case where there are no LP_DEAD bits
set at all, which is an important goal of this project.)

Minor note: Technically there aren't any promising tuples involved,
because that only makes sense when we are not going to visit every
possible heap page (just the "most promising" heap pages). But we are
going to visit every possible heap page with the new LP_DEAD bit
deletion code (which could occasionally mean visiting 10 or more heap
pages, which is a lot more than bottom-up index deletion will ever
visit). All we need to do with the new LP_DEAD deletion logic is to
include all possible matching TIDs (not just those that are marked
LP_DEAD already).

> What I meant to ask — will LP_DEAD be set by IndexScan in the presence of the long transaction?

That works in the same way as before, even with the new LP_DEAD
deletion code. The new code uses the same information as before (set
LP_DEAD bits), which is generated in the same way as before. The
difference is in how the information is actually used during LP_DEAD
deletion -- we can now delete some extra things in certain common
cases.

In practice this (and bottom-up deletion) make nbtree more robust
against disruption caused by long running transactions that hold a
snapshot open. It's hard to give a simple explanation of why that is,
because it's a second order effect. The patch is going to make it
possible to recover when LP_DEAD bits suddenly stop being set because
of an old snapshot -- now we'll have a "second chance", and maybe even
a third chance. But if the snapshot is held open *forever*, then a
second chance has no value.

Here is a thought experiment that might be helpful:

Imagine Postgres just as it is today (without the patch), except that
VACUUM runs very frequently, and is infinitely fast (this is a magical
version of VACUUM). This solves many problems, but does not solve all
problems. Magic Postgres will become just as slow as earthly Postgres
when there is a snapshot that is held open for a very long time. That
will take longer to happen compared to earthly/mortal Postgres, but
eventually there will be no difference between the two at all. But,
when you don't have such an extreme problem, magic Postgres really is
much faster.

I think that it will be possible to approximate the behavior of magic
Postgres using techniques like bottom-up deletion, the new LP_DEAD
deletion thing we've been talking today, and maybe other enhancements
in other areas (like in heap pruning). It doesn't matter that we don't
physically remove garbage immediately, as long as we "logically"
remove it immediately. The actually physical removal can occur in a
just in time, incremental fashion, creating the illusion that VACUUM
really does run infinitely fast. No magic required.

Actually, in a way this isn't new; we have always "logically" removed
garbage at the earliest opportunity (by which I mean we allow that it
can be physically removed according to an oldestXmin style cutoff,
which can be reacquired/updated the second the oldest MVCC snapshot
goes away). We don't think of useless old versions as being "logically
removed" the instant an old snapshot goes away. But maybe we should --
it's a useful mental model.

It will also be very helpful to add "remove useless intermediate
versions" logic at some point. This is quite a distinct area to what I
just described, but it's also important. We need both, I think.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Keisuke Kuroda
Date:
Subject: Re: Huge memory consumption on partitioned table with FKs
Next
From: "movead.li@highgo.ca"
Date:
Subject: Re: Asynchronous Append on postgres_fdw nodes.