heapam and bottom-up garbage collection, keeping version chains short (Was: Deleting older versions in unique indexes to avoid page splits) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject heapam and bottom-up garbage collection, keeping version chains short (Was: Deleting older versions in unique indexes to avoid page splits)
Date
Msg-id CAH2-Wznu00SnGSw-idMj9DLy59XJxSLWvshRCO5sGC--eaguMQ@mail.gmail.com
Whole thread Raw
In response to Re: Deleting older versions in unique indexes to avoid page splits  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, Oct 22, 2020 at 6:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But that being said, I'm not trying to derail this patch. It isn't,
> and shouldn't be, the job of this patch to solve that problem. It's
> just better if it doesn't regress things, or maybe even (as you say)
> makes them a little better. I think the idea you've got here is
> basically good, and a lot of it comes down to how well it works in
> practice.

Thanks. I would like to have a more general conversation about how
some of the principles embodied by this patch could be generalized to
heapam in the future. I think that we could do something similar with
pruning, or at least with the truncation of HOT chains. Here are a few
of the principles I'm thinking of, plus some details of how they might
be applied in heapam:

* The most important metric is not the total amount of dead tuples in
the whole table. Rather, it's something like the 90th + 99th
percentile length of version chains for each logical row, or maybe
only those logical rows actually accessed by index scans recently.
(Not saying that we should try to target this in some explicit way -
just that it should be kept low with real workloads as a natural
consequence of the design.)

* Have a variety of strategies for dealing with pruning that compete
with each other. Allow them to fight it out organically. Start with
the cheapest thing and work your way up. So for example, truncation of
HOT chains is always preferred, and works in just the same way as it
does today. Then it gets more complicated and more expensive, but in a
way that is a natural adjunct of the current proven design.

I believe that this creates a bit more pain earlier on, but avoids
much more pain later. There is no point powering ahead if we'll end up
hopelessly indebted with garbage tuples in the end.

For heapam we could escalate from regular pruning by using an old
version store for versions that were originally part of HOT chains,
but were found to not fit on the same page at the point of
opportunistic pruning. The old version store is for committed versions
only -- it isn't really much like UNDO. We leave forwarding
information on the original heap page.

We add old committed versions to the version store at the point when a
heap page is about to experience an event that is roughly the
equivalent of a B-Tree page split -- for heapam a "split" is being
unable to keep the same logical row on the same heap page over time in
the presence of many updates. This is our last line of defense against
this so-called page split situation. It occurs after regular pruning
(which is an earlier line of defense) fails to resolve the situation
inexpensively. We can make moving tuples into the old version store
WAL efficient by making it work like an actual B-Tree page split -- it
can describe the changes in a way that's mostly logical, and based on
the existing page image. And like an actual page split, we're
amortizing costs in a localized way.

It is also possible to apply differential compression to whole HOT
chains rather than storing them in the old version store, for example,
if that happens to look favorable (think of rows with something like a
pgbench_accounts.filler column -- not uncommon). We have options, we
can add new options in the future as new requirements come to light.
We allow the best option to present itself to us in a bottom-up
fashion. Sometimes the best strategy at the local level is actually a
combination of two different strategies applied alternately over time.
For example, we use differential compression first when we fail to
prune, then we prune the same page later (a little like merge
deduplication in my recent nbtree delete dedup patch).

Maybe each of these two strategies (differential compression +
traditional heap HOT chain truncation) get applied alternately against
the same heap page over time, in a tick-tock fashion. We naturally
avoid availing of the old version store structure, which is good,
since that is a relatively expensive strategy that we should apply
only as a last resort. This tick-tock behavior is an emergent property
of the workload rather than something planned or intelligent, and yet
it kind of appears to be an intelligent strategy. (Maybe it works that
way permanently in some parts of the heap, or maybe the same heap
blocks only tick-tock like this on Tuesdays. It may be possible for
stuff like that to happen sanely with well chosen simple heuristics
that exploit asymmetry.)

* Work with (not against) the way that Postgres strongly decouples the
physical representation of data from the logical contents of the
database compared to other DB systems. But at the same time, work hard
to make the physical representation of the data as close as is
practically possible to an idealized, imaginary logical version of the
data. Do this because it makes queries faster, not because it's
strictly necessary for concurrency control/recovery/whatever.

Concretely, this mostly means that we try our best to keep each
logical row (i.e. the latest physical version or two of each row)
located on the same physical heap block over time, using the
escalation strategy I described or something like it. Notice that
we're imposing a cost on queries that are arguably responsible for
creating garbage, but only when we really can't tolerate more garbage
collection debt. But if it doesn't work out that's okay -- we tried. I
have a feeling that it will in fact mostly work out. Why shouldn't it
be possible to have no more than one or two uncommitted row versions
in a heap page at any given time, just like with my nbtree patch? (I
think that the answer to that question is "weird workloads", but I'm
okay with the idea that they're a little slower or whatever.)

Notice that this makes the visibility map work better in practice. I
also think that the FSM needs to be taught that it isn't a good thing
to reuse a little fragment of space on its own, because it works
against our goal of trying to avoid relocating rows. The current logic
seems focussed on using every little scrap of free space no matter
what, which seems pretty misguided. Penny-wise, pound-foolish.

Also notice that fighting to keep the same logical row on the same
block has a non-linear payoff. We don't need to give up on that goal
at the first sign of trouble. If it's hard to do a conventional prune
after succeeding a thousand times then it's okay to work harder. Only
a sucker gives up at the first sign of trouble. We're playing a long
game here. If it becomes so hard that even applying a more aggressive
strategy fails, then it's probably also true that it has become
inherently impossible to sustain the original layout. We graciously
accept defeat and cut our losses, having only actually wasted a little
effort to learn that we need to move our incoming successor version to
some other heap block (especially because the cost is amortized across
versions that live on the same heap block).

* Don't try to completely remove VACUUM. Treat its top down approach
as complementary to the new bottom-up approaches we add.

There is nothing wrong with taking a long time to clean up garbage
tuples in heap pages that have very little garbage total. In fact I
think that it might be a good thing. Why be in a hurry to dirty the
page again? If it becomes a real problem in the short term then the
bottom-up stuff can take care of it. Under this old-but-new paradigm,
maybe VACUUM has to visit indexes a lot less (maybe it just decides
not to sometimes, based on stats about the indexes, like we see today
with vacuum_index_cleanup = off). VACUUM is for making infrequent
"clean sweeps", though it typically leaves most of the work to new
bottom-up approaches, that are well placed to understand the needs of
queries that touch nearby data. Autovacuum does not presume to
understand the needs of queries at all.

It would also be possible for VACUUM to make more regular sweeps over
the version store without disturbing the main relation under this
model. The version store isn't that different to a separate heap
relation, but naturally doesn't require traditional index cleanup or
freezing, and naturally stores things in approximately temporal order.
So we recycle space in the version store in a relatively eager,
circular fashion, because it naturally contains data that favors such
an approach. We make up the fixed cost of using the separate old
version store structure by reducing deferred costs like this. And by
only using it when it is demonstrably helpful.

It might also make sense for us to prioritize putting heap tuples that
represent versions whose indexed columns were changed by update
(basically a non-hot update) in the version store -- we work extra
hard on that (and just leave behind an LP_DEAD line pointer). That way
VACUUM can do retail index tuple deletion for the index whose columns
were modified when it finds them in the version store (presumably this
nbtree patch of mine works well enough with the logically unchanged
index entries for other indexes that we don't need to go out of our
way).

I'm sure that there are more than a few holes in this sketch of mine.
It's not really worked out, but it has certain properties that are
generally under appreciated -- especially the thing about the worst
case number of versions per logical row being extremely important, as
well as the idea that back pressure can be a good thing when push
comes to shove -- provided it is experienced locally, and only by
queries that update the same very hot logical rows. Back pressure
needs to be proportionate and approximately fair. Another important
point that I want to call out again here is that we should try to
exploit cost/benefit asymmetry opportunistically. That seems to have
worked out extremely well in my recent B-Tree delete dedup patch.

I don't really expect anybody to take this seriously -- especially as
a total blueprint. Maybe some elements of what I've sketched can be
used as part of some future big effort. You've got to start somewhere.
It has to be incremental.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: "tsunakawa.takay@fujitsu.com"
Date:
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Enumize logical replication message actions