The Free Space Map: Problems and Opportunities - Mailing list pgsql-hackers

From Peter Geoghegan
Subject The Free Space Map: Problems and Opportunities
Date
Msg-id CAH2-Wzk56qXpK4bSbiySauVyvZ4B4hF9Gm99T1Vtat+aomcaWQ@mail.gmail.com
Whole thread Raw
Responses Re: The Free Space Map: Problems and Opportunities
Re: The Free Space Map: Problems and Opportunities
Re: The Free Space Map: Problems and Opportunities
List pgsql-hackers
I have suspected that there are serious design issues with the FSM
(and related heapam code like hio.c) for several years now [1]. This
has a lot to do with the experience of using Jan Wieck's BenchmarkSQL
implementation of TPC-C [2][3][4]. It clearly makes Postgres exhibit
pathological performance issues, especially heap fragmentation.

There is a clear way in which free space in the two largest tables
(orders and order lines) ought to be managed, given the fixed pattern
of the workload, but that isn't what we see. I had the opportunity to
work with Jan Wieck and Greg Smith directly on this, shortly before I
left Crunchy Data several weeks ago. Jan was able to fix some issues
on the BenchmarkSQL side, which helped a lot. But the real problems
remain. It is generally understood that the FSM leads to Postgres
doing badly with TPC-C. I personally believe that this is
representative of real practical problems, and not just a problem with
this one benchmark.

This email is an attempt to impose some order on a disorderly problem
space. I'd like to compare notes with other people that are interested
in the problem. I suspect that some experienced hackers have yet to be
convinced of the importance of the FSM when it comes to bloat.
Hopefully I'll manage to make that case convincingly on this thread.

If any of my more concrete claims about problems in the FSM seem like
they need to be justified, please let me know. I am omitting details
in this initial overview email for the sake of brevity. It's already
too long.

Problems
--------

The FSM gives out space without considering the passage of time, or
the likelihood that a particular transaction or client will consume
its requested free space in a more or less predictable and steady way
over time. It has no memory. The FSM therefore fails to capture
naturally occurring locality, or at least doesn't specifically care
about it. This is the central problem, that all other problems with
the FSM seem to stem from in one way or another.

The FSM should group related rows (e.g. rows from the same transaction
or backend) together, so that they reliably go on the same heap page
-- careless mixing of unrelated rows should be avoided. When I say
"related", I mean related in whatever sense the application or end
user thinks of them as related. As a general rule we should expect
groups of rows that were inserted by the same transaction to also be
read, updated, deleted, or removed by VACUUM together, as a group.
While this isn't guaranteed, it's a good working assumption for the
FSM. It avoids unnecessary fragmentation. The current FSM design
causes significant fragmentation with workloads where users *expect*
no fragmentation at all. I see this with TPC-C. The way that the FSM
*ought* to behave for the workload in question is intuitively obvious,
once you lay it all out. But that's not what we see in practice -- far
from it. (Happy to go into more detail on this.)

You don't even need temporal locality to see problems. Even random
inserts show up certain FSM implementation problems. The FSM lacks
even a basic understanding of the *aggregate* result of backends
applying various FSM heuristics over time, and with concurrent access.
Inserting backends currently behave like an amateur soccer team where
every individual soccer player chases the ball, without any
coordination among team members. The players in this analogy somehow
fail to consider where the ball *is about to be* -- there is no
temporal awareness. They also miss the importance of cooperating with
each other as a team -- there is also no spatial awareness, and no
thought given to second order effects. This leads to increased buffer
lock contention and fragmentation.

(Other known FSM problems have been omitted for the sake of brevity.)

Background
----------

To recap, the FSM tracks how much free space is in every heap page.
Most FSM maintenance takes place during VACUUM, though ordinary
connection backends occasionally help to keep the FSM current, via
calls to RecordAndGetPageWithFreeSpace() made from hio.c.

There is also a bulk extension mechanism added by commit 719c84c1be,
which is used when we detect multiple concurrent inserters. This bulk
extension mechanism does change things a little (it gives some thought
to systematic effects), though it still has the same basic issues: it
doesn't do nearly enough to group likely-related rows together. I
suspect that improving the bulk extension mechanism will be an
important part of improving the overall picture. That mechanism needs
to be more explicit, and more careful about who gets what free space
when. I'll go into the significance of this mechanism to the FSM
below, under "Opportunities". But first, more background information.

Papers
------

I've taken time to review the database research literature, which
doesn't have all that much to say here. But there are a couple of
relevant classic papers that I know of [6][7].

A lot of the considerations for free space management in heap-based
systems like DB2 and Oracle are naturally intertwined with concurrency
control, recovery, and even heavyweight locking. These systems must
treat TIDs as stable identifiers of a logical row, and not identifiers
of a physical tuple -- another important difference. "Free space" is
only truly available to reuse in these systems some time after a
deleter commits and releases its heavyweight locks. This is why their
FSM equivalents must be WAL-logged. There is even a need for logical
UNDO to cover these FSM-equivalent data structures, which have their
own tricky rollback path issues that align with those in the heap
structure. Everything is tied together to avoid rollback safety
hazards. Physical rollback cannot find that the tuple needed to
restore form UNDO will no longer physically fit on the original heap
page. Plus the line pointer can't just be recycled. It's all quite
delicate.

At first I thought that this UNDO rollback safety business meant that
I had nothing to learn from these old papers. Postgres naturally
doesn't have to care about a transaction reusing space before another
transaction that deleted the space commits (or after an inserter
aborts) -- in Postgres, space is just that: space. We always have the
*option* to create a new HOT chain on another page. However, I
eventually changed my mind -- these old papers are relevant. In a way
Postgres actually does have rollback. It doesn't involve delicate
physical handling, in large part because TID stability isn't
guaranteed in Postgres. But Postgres style "logical rollback" is more
similar than it is different.

Logical database
----------------

While it certainly is good that Postgres has the freedom to allow the
physical representation of the logical database to diverge in all
kinds of ways, that doesn't mean that we should be totally unconcerned
about just how far the divergence goes. It's good that we have the
option to retain a few extra versions without considering rollback
hazards. But an option isn't an obligation. That's what makes an
option useful -- it gives us the ability to have our cake, and eat it
too.

Clearly we should try to avoid the need for a new HOT chain on another
heap block, for example. OTOH we shouldn't bend over backwards to make
sure of it -- that can't be worth it, since it isn't a matter of
fundamental correctness with Postgres style concurrency control. As I
said, we should try to be a little bit like ARIES with full UNDO
rollback, purely to avoid bloat and improve temporal locality --
concurrent inserters should not clobber the same heap pages. I believe
that something a little like FSM transaction rollback is just what we
need here. A design that is inspired by other existing designs,
including even the UNDO + rollback parts, which are pretty essential.
Transaction boundaries and ownership semantics for free space really
do seem to matter when it comes to avoiding bloat.

Roughly the same high level approach led to my developing bottom-up
index deletion for Postgres 14 [5] -- the same "logical vs physical"
tension exists in Postgres B-Tree indexes. Using "logical database
surrogates" in the physical database is actually a standard family of
techniques used in DB systems design. This family of techniques has
been underexploited within Postgres, probably because it isn't so
natural in a world without UNDO style rollback. This family of
techniques is something to keep in your "bag of tricks" as a Postgres
hacker (as Hellerstein once put it).

I'll return to how we might actually do something a bit like UNDO
style rollback in the FSM later on. It is counterintuitive, but stay
with me.

Open and closed pages
---------------------

Note that other designs for FSM style structures (for a heap table
access method) all seem to have fewer than 10 possible "page has this
much free space" increments [6] -- the granularity is far coarser than
our own FSM_CATEGORIES granularity (which has a massive 255 distinct
increments of free space for each heap page). One reason for this is
that ignoring tiny differences avoids contention and fragmentation
from chasing insignificant differences between heap blocks, which I
don't need to go into again (my earlier amatuer soccer team analogy is
enough for now). But it's also related to the logical
database/rollback stuff that I just went into. It sometimes makes
sense to think of a physical heap page as more than just a data
structure. Even things that we tend to naturally think of as strictly
physical may need to be rethought just a little.

Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
ever end up inserting *any* new logical rows on a heap page after the
page first fills -- it is initially "open" when first allocated, and
then quickly becomes "closed" to inserts of new logical rows, once it
crosses a certain almost-empty space threshold. The heap page usually
stays "closed" forever. While the design does not *completely* ignore
the possibility that the page won't eventually get so empty that reuse
really does look like a good idea, it makes it rare. A heap page needs
to have rather a lot of the original logical rows deleted before that
becomes possible again. It's rare for it to go backwards because the
threshold that makes a closed page become open again is much lower
(i.e. involves much more free space) than the threshold that initially
made a (newly allocated) heap page closed. This one-way stickiness
quality is important. As with simulated annealing algorithms, our
algorithm has heap pages that naturally settle. Our algorithm makes it
hard to change things once they become settled -- it has to be really
worth it before we'll flip a page back to "open" again. There is a
general bias against disturbing that which has become the settled
state. (I think that the terms "open" and "closed" come from the world
of online bin packing problems, where the same tension and
uncertainties exist.)

This stickiness concept is called "hysteresis" by some DB researchers,
often when discussing UNDO stuff [8]. Having *far less* granularity
than FSM_CATEGORIES/255 seems essential to make that work as intended.
Pages need to be able to settle without being disturbed by noise-level
differences. That's all that super fine grained values buy you: more
noise, more confusion.

Visibility map
--------------

If the logical database and natural locality are important to the FSM,
then what about the visibility map? And what about the relationship
between the FSM and the visibility map, in light of all this?

Currently VACUUM doesn't care about how its FSM behavior interacts
with how it sets all-frozen/all-visible bits for the same heap page.
To me this seems completely unreasonable -- they're *obviously*
related! We're probably only gaining a minimal amount of free space on
one occasion by ignoring the VM/FSM relationship, for which we pay a
high cost. Worst of all we're *perpetuating the cycle* of dirtying and
redirtying the same pages over time. Maybe we should go as far as
merging the FSM and VM, even -- that seems like a natural consequence
of logical-ish/qualitative definitions of "page is full".

Opportunities
-------------

Now more about what I think a new FSM design for Postgres *should* do.
This is almost a description of an actual data structure. A new FSM.
But not quite.

As we've seen, free space management in a true ARIES-style system
(like the DB2 FSIP design) is forced to think of free space in terms
of "leases", or bulk-allocated free lists of usually-contiguous pages.
Under a true ARIES design, it is strictly necessary to track which
deleter XID freed-up which individual lease of free space lest some
concurrent inserter reuse the space before it is truly safe to do so
-- the deleter xact must commit before that reuse can be safely
allowed. While delete + commit does seem mostly irrelevant to the
Postgres FSM, insert + abort case handling seems to be something that
we can learn from. Maybe the same principles can be applied in
Postgres, where we need leases to avoid leaks in the presence of
errors (not necessarily rollback per se). We must be able to tolerate
incorrect speculations about where free space will be needed next.
Transactions must be given the opportunity to make use of the heap
free space lease that they asked for -- but not forever. It's a
balancing act.

Imagine if the FSM tracked recent free space requests that have been
satisfied already. Requesting backends will maintain this same
metadata when they burn through their bulk allocation of reserved
space/heap pages. This new structure would contain metadata like the
requesting XID/backend for a given lease of heap pages, the number of
leases during the last bulk extension operation, and maybe the number
of concurrently waiting backends at that time. Now we have some sense
of recent history, and start to recognize trends over time. The FSM is
now able to *take back* some of the space that it gives out, based on
new information. Now the FSM can afford to make mistakes because the
mistakes can always be corrected a little later. The FSM can be
generous based on speculation, heuristics, whatever it might be. It
should be possible for the FSM to have it both ways, more or less.

Consider the bulk extension mechanism for concurrent inserters that
was added by commit 719c84c1be once again.

The mechanism's leader backend process performs the actual bulk heap
relation extension, allocating up to 512 new heap pages all at once.
Even when the leader backend ends up extending a heap relation by
quite a few pages (hundreds), the leader is still unable to *directly*
hand off free space to particular backends that drove the leader to
extend aggressively in the first place. It would be quite natural for
the leader backend to say to each individual follower backend: thank
you for waiting for me, here is a big contiguous range of heap pages
(e.g., as many as 50 heap pages or more per xact when things truly
ramp up), which should be enough to allow you to avoid going to the
FSM again for quite a long time.

But the leader backend doesn't do that at all -- it just puts all of
the pages (up to 512 total) in the FSM. All of the waiting follower
backends are left to senselessly fight it out for free space by simply
going back to the FSM when they wake up again. Why can't they
cooperate intelligently? What stops that today?

FSM "rollback"
-------------

(This is where my argument comes together, finally.)

FSM rollback in the FSIP paper is really just about reliable ownership
semantics. Ownership of free space in heap pages that is scoped at the
level of transactions.

It seems to me that ownership of free space in heap pages by
particular transactions is really all that the bulk extension FSM
logic is missing -- and getting that right seems like a big part of
fixing the FSM comprehensively. The bulk extension mechanism cannot
*trust* the backends to use more than a little space at a time today.
While on average each backend's use of free space probably *is*
predictable over time, it only takes one or two straggler backends per
bulk operation to mess everything up -- that's enough for the system
to end up with very poor space utilization. Since we don't have any
sense of ownership of space in heap pages by transactions today, the
code must hedge against stragglers/space leaks by making the space
*generally* available to all. Of course, this hedging comes at a high
cost: it prevents the whole system (heapam, FSM, and backends) from
capturing naturally occurring locality in a variety of important
cases. Including the important TPC-C case I mentioned.

Explicit scoped ownership of free space in heap pages removes the need
to hedge. It addresses the problem of would-be space leaks directly,
while *also* fixing the FSM locality problems implicitly. Metadata
about recent requests gives us the *context* needed to do all this
well. Sure, we don't need it for transaction rollback using ARIES
style UNDO, but we can still use it for this. Perhaps this can be
pushed much further. Long running transactions ought to have as much
time as they need to use the large lease of free space that they were
provided. But when the transaction commits or aborts, we might then
make the FSM code much more aggressive about stealing space back from
the backend that inherits the lease from the now-committed
transaction. It all depends.

Maybe we can teach VACUUM to eagerly clean up aborted transactions
that are known to have consumed a lot of free space that turns out to
not be needed for the inserted rows. It might also be feasible to make
our new slimmed-down FSM crash-safe. The way that REDO routines for
VACUUM handle FSM maintenance is pretty sloppy right now.

Bean counting
------------

In general I find the idea that an algorithm can always make better
decisions with more information dubious. Accounting for every tiny
little fragment of free space ("bean counting") is definitely not an
intrinsic good. Maybe keeping that information is not just wasteful --
maybe it's actually harmful. There is a real risk of overinterpreting
noise-level difference [9]. I'd prefer to err in the direction of
underfitting.

I now wonder if the FSM is fundamentally doing the wrong thing by
keeping track of all "free space" in every page over time. Why
wouldn't we just have a free space management strategy that is
generally only concerned with recent events? If we find a way to make
almost all newly allocated heap pages become "closed" quickly (maybe
they're marked all-frozen quickly too), and make sure that that
condition is sticky, then this can work out well. We may not even need
to store explicit freespace information for most heap pages in the
database -- a page being "closed" can be made implicit by the FSM
and/or VM. Making heap pages age-out like this (and mostly stay that
way over time) has obvious benefits in many different areas.

This won't help workloads like stock pgbench, which is pretty
unrealistic anyway. But it needn't hurt them either.

[1] https://postgr.es/m/CAH2-WzkEP6wz2Eg7mgKbF-qTPi21+BWrJgNm0qfU5kf0iJBV2g@mail.gmail.com
[2] https://postgr.es/m/0265f9e2-3e32-e67d-f106-8abde596c0e4@commandprompt.com
[3] https://github.com/wieck/benchmarksql/commit/6ac326ad23d55de2edc18bfbffb10b21f1b39b48
[4] https://github.com/wieck/benchmarksql/commit/0830c5f526061529eb2b45831c544539caa65435
[5] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf
[6] https://dl.acm.org/doi/abs/10.1145/235968.233355
[7]
https://www.semanticscholar.org/paper/Algorithms-for-Flexible-Space-Management-in-Systems-Mohan-Haderle/9db1a8104daade31949b6399cac9169751fa1605
[8] https://queue.acm.org/detail.cfm?id=3099561
[9] https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#Approaches
--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: Autovacuum on partitioned table (autoanalyze)
Next
From: "Bossart, Nathan"
Date:
Subject: Re: Catalog version access