Thread: The Free Space Map: Problems and Opportunities

The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
I have suspected that there are serious design issues with the FSM
(and related heapam code like hio.c) for several years now [1]. This
has a lot to do with the experience of using Jan Wieck's BenchmarkSQL
implementation of TPC-C [2][3][4]. It clearly makes Postgres exhibit
pathological performance issues, especially heap fragmentation.

There is a clear way in which free space in the two largest tables
(orders and order lines) ought to be managed, given the fixed pattern
of the workload, but that isn't what we see. I had the opportunity to
work with Jan Wieck and Greg Smith directly on this, shortly before I
left Crunchy Data several weeks ago. Jan was able to fix some issues
on the BenchmarkSQL side, which helped a lot. But the real problems
remain. It is generally understood that the FSM leads to Postgres
doing badly with TPC-C. I personally believe that this is
representative of real practical problems, and not just a problem with
this one benchmark.

This email is an attempt to impose some order on a disorderly problem
space. I'd like to compare notes with other people that are interested
in the problem. I suspect that some experienced hackers have yet to be
convinced of the importance of the FSM when it comes to bloat.
Hopefully I'll manage to make that case convincingly on this thread.

If any of my more concrete claims about problems in the FSM seem like
they need to be justified, please let me know. I am omitting details
in this initial overview email for the sake of brevity. It's already
too long.

Problems
--------

The FSM gives out space without considering the passage of time, or
the likelihood that a particular transaction or client will consume
its requested free space in a more or less predictable and steady way
over time. It has no memory. The FSM therefore fails to capture
naturally occurring locality, or at least doesn't specifically care
about it. This is the central problem, that all other problems with
the FSM seem to stem from in one way or another.

The FSM should group related rows (e.g. rows from the same transaction
or backend) together, so that they reliably go on the same heap page
-- careless mixing of unrelated rows should be avoided. When I say
"related", I mean related in whatever sense the application or end
user thinks of them as related. As a general rule we should expect
groups of rows that were inserted by the same transaction to also be
read, updated, deleted, or removed by VACUUM together, as a group.
While this isn't guaranteed, it's a good working assumption for the
FSM. It avoids unnecessary fragmentation. The current FSM design
causes significant fragmentation with workloads where users *expect*
no fragmentation at all. I see this with TPC-C. The way that the FSM
*ought* to behave for the workload in question is intuitively obvious,
once you lay it all out. But that's not what we see in practice -- far
from it. (Happy to go into more detail on this.)

You don't even need temporal locality to see problems. Even random
inserts show up certain FSM implementation problems. The FSM lacks
even a basic understanding of the *aggregate* result of backends
applying various FSM heuristics over time, and with concurrent access.
Inserting backends currently behave like an amateur soccer team where
every individual soccer player chases the ball, without any
coordination among team members. The players in this analogy somehow
fail to consider where the ball *is about to be* -- there is no
temporal awareness. They also miss the importance of cooperating with
each other as a team -- there is also no spatial awareness, and no
thought given to second order effects. This leads to increased buffer
lock contention and fragmentation.

(Other known FSM problems have been omitted for the sake of brevity.)

Background
----------

To recap, the FSM tracks how much free space is in every heap page.
Most FSM maintenance takes place during VACUUM, though ordinary
connection backends occasionally help to keep the FSM current, via
calls to RecordAndGetPageWithFreeSpace() made from hio.c.

There is also a bulk extension mechanism added by commit 719c84c1be,
which is used when we detect multiple concurrent inserters. This bulk
extension mechanism does change things a little (it gives some thought
to systematic effects), though it still has the same basic issues: it
doesn't do nearly enough to group likely-related rows together. I
suspect that improving the bulk extension mechanism will be an
important part of improving the overall picture. That mechanism needs
to be more explicit, and more careful about who gets what free space
when. I'll go into the significance of this mechanism to the FSM
below, under "Opportunities". But first, more background information.

Papers
------

I've taken time to review the database research literature, which
doesn't have all that much to say here. But there are a couple of
relevant classic papers that I know of [6][7].

A lot of the considerations for free space management in heap-based
systems like DB2 and Oracle are naturally intertwined with concurrency
control, recovery, and even heavyweight locking. These systems must
treat TIDs as stable identifiers of a logical row, and not identifiers
of a physical tuple -- another important difference. "Free space" is
only truly available to reuse in these systems some time after a
deleter commits and releases its heavyweight locks. This is why their
FSM equivalents must be WAL-logged. There is even a need for logical
UNDO to cover these FSM-equivalent data structures, which have their
own tricky rollback path issues that align with those in the heap
structure. Everything is tied together to avoid rollback safety
hazards. Physical rollback cannot find that the tuple needed to
restore form UNDO will no longer physically fit on the original heap
page. Plus the line pointer can't just be recycled. It's all quite
delicate.

At first I thought that this UNDO rollback safety business meant that
I had nothing to learn from these old papers. Postgres naturally
doesn't have to care about a transaction reusing space before another
transaction that deleted the space commits (or after an inserter
aborts) -- in Postgres, space is just that: space. We always have the
*option* to create a new HOT chain on another page. However, I
eventually changed my mind -- these old papers are relevant. In a way
Postgres actually does have rollback. It doesn't involve delicate
physical handling, in large part because TID stability isn't
guaranteed in Postgres. But Postgres style "logical rollback" is more
similar than it is different.

Logical database
----------------

While it certainly is good that Postgres has the freedom to allow the
physical representation of the logical database to diverge in all
kinds of ways, that doesn't mean that we should be totally unconcerned
about just how far the divergence goes. It's good that we have the
option to retain a few extra versions without considering rollback
hazards. But an option isn't an obligation. That's what makes an
option useful -- it gives us the ability to have our cake, and eat it
too.

Clearly we should try to avoid the need for a new HOT chain on another
heap block, for example. OTOH we shouldn't bend over backwards to make
sure of it -- that can't be worth it, since it isn't a matter of
fundamental correctness with Postgres style concurrency control. As I
said, we should try to be a little bit like ARIES with full UNDO
rollback, purely to avoid bloat and improve temporal locality --
concurrent inserters should not clobber the same heap pages. I believe
that something a little like FSM transaction rollback is just what we
need here. A design that is inspired by other existing designs,
including even the UNDO + rollback parts, which are pretty essential.
Transaction boundaries and ownership semantics for free space really
do seem to matter when it comes to avoiding bloat.

Roughly the same high level approach led to my developing bottom-up
index deletion for Postgres 14 [5] -- the same "logical vs physical"
tension exists in Postgres B-Tree indexes. Using "logical database
surrogates" in the physical database is actually a standard family of
techniques used in DB systems design. This family of techniques has
been underexploited within Postgres, probably because it isn't so
natural in a world without UNDO style rollback. This family of
techniques is something to keep in your "bag of tricks" as a Postgres
hacker (as Hellerstein once put it).

I'll return to how we might actually do something a bit like UNDO
style rollback in the FSM later on. It is counterintuitive, but stay
with me.

Open and closed pages
---------------------

Note that other designs for FSM style structures (for a heap table
access method) all seem to have fewer than 10 possible "page has this
much free space" increments [6] -- the granularity is far coarser than
our own FSM_CATEGORIES granularity (which has a massive 255 distinct
increments of free space for each heap page). One reason for this is
that ignoring tiny differences avoids contention and fragmentation
from chasing insignificant differences between heap blocks, which I
don't need to go into again (my earlier amatuer soccer team analogy is
enough for now). But it's also related to the logical
database/rollback stuff that I just went into. It sometimes makes
sense to think of a physical heap page as more than just a data
structure. Even things that we tend to naturally think of as strictly
physical may need to be rethought just a little.

Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
ever end up inserting *any* new logical rows on a heap page after the
page first fills -- it is initially "open" when first allocated, and
then quickly becomes "closed" to inserts of new logical rows, once it
crosses a certain almost-empty space threshold. The heap page usually
stays "closed" forever. While the design does not *completely* ignore
the possibility that the page won't eventually get so empty that reuse
really does look like a good idea, it makes it rare. A heap page needs
to have rather a lot of the original logical rows deleted before that
becomes possible again. It's rare for it to go backwards because the
threshold that makes a closed page become open again is much lower
(i.e. involves much more free space) than the threshold that initially
made a (newly allocated) heap page closed. This one-way stickiness
quality is important. As with simulated annealing algorithms, our
algorithm has heap pages that naturally settle. Our algorithm makes it
hard to change things once they become settled -- it has to be really
worth it before we'll flip a page back to "open" again. There is a
general bias against disturbing that which has become the settled
state. (I think that the terms "open" and "closed" come from the world
of online bin packing problems, where the same tension and
uncertainties exist.)

This stickiness concept is called "hysteresis" by some DB researchers,
often when discussing UNDO stuff [8]. Having *far less* granularity
than FSM_CATEGORIES/255 seems essential to make that work as intended.
Pages need to be able to settle without being disturbed by noise-level
differences. That's all that super fine grained values buy you: more
noise, more confusion.

Visibility map
--------------

If the logical database and natural locality are important to the FSM,
then what about the visibility map? And what about the relationship
between the FSM and the visibility map, in light of all this?

Currently VACUUM doesn't care about how its FSM behavior interacts
with how it sets all-frozen/all-visible bits for the same heap page.
To me this seems completely unreasonable -- they're *obviously*
related! We're probably only gaining a minimal amount of free space on
one occasion by ignoring the VM/FSM relationship, for which we pay a
high cost. Worst of all we're *perpetuating the cycle* of dirtying and
redirtying the same pages over time. Maybe we should go as far as
merging the FSM and VM, even -- that seems like a natural consequence
of logical-ish/qualitative definitions of "page is full".

Opportunities
-------------

Now more about what I think a new FSM design for Postgres *should* do.
This is almost a description of an actual data structure. A new FSM.
But not quite.

As we've seen, free space management in a true ARIES-style system
(like the DB2 FSIP design) is forced to think of free space in terms
of "leases", or bulk-allocated free lists of usually-contiguous pages.
Under a true ARIES design, it is strictly necessary to track which
deleter XID freed-up which individual lease of free space lest some
concurrent inserter reuse the space before it is truly safe to do so
-- the deleter xact must commit before that reuse can be safely
allowed. While delete + commit does seem mostly irrelevant to the
Postgres FSM, insert + abort case handling seems to be something that
we can learn from. Maybe the same principles can be applied in
Postgres, where we need leases to avoid leaks in the presence of
errors (not necessarily rollback per se). We must be able to tolerate
incorrect speculations about where free space will be needed next.
Transactions must be given the opportunity to make use of the heap
free space lease that they asked for -- but not forever. It's a
balancing act.

Imagine if the FSM tracked recent free space requests that have been
satisfied already. Requesting backends will maintain this same
metadata when they burn through their bulk allocation of reserved
space/heap pages. This new structure would contain metadata like the
requesting XID/backend for a given lease of heap pages, the number of
leases during the last bulk extension operation, and maybe the number
of concurrently waiting backends at that time. Now we have some sense
of recent history, and start to recognize trends over time. The FSM is
now able to *take back* some of the space that it gives out, based on
new information. Now the FSM can afford to make mistakes because the
mistakes can always be corrected a little later. The FSM can be
generous based on speculation, heuristics, whatever it might be. It
should be possible for the FSM to have it both ways, more or less.

Consider the bulk extension mechanism for concurrent inserters that
was added by commit 719c84c1be once again.

The mechanism's leader backend process performs the actual bulk heap
relation extension, allocating up to 512 new heap pages all at once.
Even when the leader backend ends up extending a heap relation by
quite a few pages (hundreds), the leader is still unable to *directly*
hand off free space to particular backends that drove the leader to
extend aggressively in the first place. It would be quite natural for
the leader backend to say to each individual follower backend: thank
you for waiting for me, here is a big contiguous range of heap pages
(e.g., as many as 50 heap pages or more per xact when things truly
ramp up), which should be enough to allow you to avoid going to the
FSM again for quite a long time.

But the leader backend doesn't do that at all -- it just puts all of
the pages (up to 512 total) in the FSM. All of the waiting follower
backends are left to senselessly fight it out for free space by simply
going back to the FSM when they wake up again. Why can't they
cooperate intelligently? What stops that today?

FSM "rollback"
-------------

(This is where my argument comes together, finally.)

FSM rollback in the FSIP paper is really just about reliable ownership
semantics. Ownership of free space in heap pages that is scoped at the
level of transactions.

It seems to me that ownership of free space in heap pages by
particular transactions is really all that the bulk extension FSM
logic is missing -- and getting that right seems like a big part of
fixing the FSM comprehensively. The bulk extension mechanism cannot
*trust* the backends to use more than a little space at a time today.
While on average each backend's use of free space probably *is*
predictable over time, it only takes one or two straggler backends per
bulk operation to mess everything up -- that's enough for the system
to end up with very poor space utilization. Since we don't have any
sense of ownership of space in heap pages by transactions today, the
code must hedge against stragglers/space leaks by making the space
*generally* available to all. Of course, this hedging comes at a high
cost: it prevents the whole system (heapam, FSM, and backends) from
capturing naturally occurring locality in a variety of important
cases. Including the important TPC-C case I mentioned.

Explicit scoped ownership of free space in heap pages removes the need
to hedge. It addresses the problem of would-be space leaks directly,
while *also* fixing the FSM locality problems implicitly. Metadata
about recent requests gives us the *context* needed to do all this
well. Sure, we don't need it for transaction rollback using ARIES
style UNDO, but we can still use it for this. Perhaps this can be
pushed much further. Long running transactions ought to have as much
time as they need to use the large lease of free space that they were
provided. But when the transaction commits or aborts, we might then
make the FSM code much more aggressive about stealing space back from
the backend that inherits the lease from the now-committed
transaction. It all depends.

Maybe we can teach VACUUM to eagerly clean up aborted transactions
that are known to have consumed a lot of free space that turns out to
not be needed for the inserted rows. It might also be feasible to make
our new slimmed-down FSM crash-safe. The way that REDO routines for
VACUUM handle FSM maintenance is pretty sloppy right now.

Bean counting
------------

In general I find the idea that an algorithm can always make better
decisions with more information dubious. Accounting for every tiny
little fragment of free space ("bean counting") is definitely not an
intrinsic good. Maybe keeping that information is not just wasteful --
maybe it's actually harmful. There is a real risk of overinterpreting
noise-level difference [9]. I'd prefer to err in the direction of
underfitting.

I now wonder if the FSM is fundamentally doing the wrong thing by
keeping track of all "free space" in every page over time. Why
wouldn't we just have a free space management strategy that is
generally only concerned with recent events? If we find a way to make
almost all newly allocated heap pages become "closed" quickly (maybe
they're marked all-frozen quickly too), and make sure that that
condition is sticky, then this can work out well. We may not even need
to store explicit freespace information for most heap pages in the
database -- a page being "closed" can be made implicit by the FSM
and/or VM. Making heap pages age-out like this (and mostly stay that
way over time) has obvious benefits in many different areas.

This won't help workloads like stock pgbench, which is pretty
unrealistic anyway. But it needn't hurt them either.

[1] https://postgr.es/m/CAH2-WzkEP6wz2Eg7mgKbF-qTPi21+BWrJgNm0qfU5kf0iJBV2g@mail.gmail.com
[2] https://postgr.es/m/0265f9e2-3e32-e67d-f106-8abde596c0e4@commandprompt.com
[3] https://github.com/wieck/benchmarksql/commit/6ac326ad23d55de2edc18bfbffb10b21f1b39b48
[4] https://github.com/wieck/benchmarksql/commit/0830c5f526061529eb2b45831c544539caa65435
[5] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf
[6] https://dl.acm.org/doi/abs/10.1145/235968.233355
[7]
https://www.semanticscholar.org/paper/Algorithms-for-Flexible-Space-Management-in-Systems-Mohan-Haderle/9db1a8104daade31949b6399cac9169751fa1605
[8] https://queue.acm.org/detail.cfm?id=3099561
[9] https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#Approaches
--
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Bruce Momjian
Date:
On Mon, Aug 16, 2021 at 10:35:45AM -0700, Peter Geoghegan wrote:
> I have suspected that there are serious design issues with the FSM
> (and related heapam code like hio.c) for several years now [1]. This
> has a lot to do with the experience of using Jan Wieck's BenchmarkSQL
> implementation of TPC-C [2][3][4]. It clearly makes Postgres exhibit
> pathological performance issues, especially heap fragmentation.

OK, here is my feedback.  First, I understand the space
reservation/soccer idea, but I am also concerned that adding space
reservation could improve some things and make others worse.

Second, let's look at a concrete example, and see how it can be improved
more simply.  As I understand it, there are three operations that need
free space:

    1.  INSERT/COPY
    2.  UPDATE where there is insufficient space on the page of the
        old row
    3.  UPDATE where there is sufficient page space

The third option already can use the existing page for a HOT update, so
the FSM doesn't matter.   For 1 & 2, suppose you have a table with 10 8k
pages, all 80% full.  As I understand it, the first page will go from
80% to 81%, then the next row addition will take the second page from
80% to 81%, until all pages are 81%, and then it starts over again.  Is
that accurate?  Is that the non-temporal memory issue you are concerned
about?    Doesn't spreading the new rows out increase the ability to do
HOT updates?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Mon, Aug 16, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote:
> OK, here is my feedback.  First, I understand the space
> reservation/soccer idea, but I am also concerned that adding space
> reservation could improve some things and make others worse.

That is definitely a legitimate concern.

There is a trade-off here: if we do too much preallocation, there may
be cases where the preallocated space that we expected to be used
quickly is never used by anybody. But that doesn't seem like a
problem, provided we don't lose track of the fact that it happened.
Then the worst that can happen is that we have empty pages that nobody
will ever use, because nobody inserts into the table ever again (not
the backend with the leaked space lease, not anybody else). That seems
manageable. We can add some kind of ramp-up behavior that gets more
and more aggressive as demand for new space from backends is steady or
increasing.

> Second, let's look at a concrete example, and see how it can be improved
> more simply.  As I understand it, there are three operations that need
> free space:
>
>         1.  INSERT/COPY
>         2.  UPDATE where there is insufficient space on the page of the
>             old row
>         3.  UPDATE where there is sufficient page space

Right.

> The third option already can use the existing page for a HOT update, so
> the FSM doesn't matter.

I agree.

All good questions. Thank you for diving in on this.

> For 1 & 2, suppose you have a table with 10 8k
> pages, all 80% full.  As I understand it, the first page will go from
> 80% to 81%, then the next row addition will take the second page from
> 80% to 81%, until all pages are 81%, and then it starts over again.  Is
> that accurate?

No. Generally backends have their own target block (a simple
BlockNumber) that they store in the relcache, one per table recently
modified. Backends/heapam uses RelationGetTargetBlock() to access this
local cache. So there is "backend affinity" for particular individual
heap pages, even today. However, that still has many problems.

It doesn't make sense to have a local cache for a shared resource --
that's the problem. You actually need some kind of basic locking or
lease system, so that 10 backends don't all decide at the same time
that one particular heap block is fully empty, and therefore a good
target block for that backend alone. It's as if the backends believe
that they're special little snowflakes, and that no other backend
could possibly be thinking the same thing at the same time about the
same heap page. And so when TPC-C does its initial bulk insert,
distinct orders are already shuffled together in a hodge-podge, just
because concurrent bulk inserters all insert on the same heap pages.

If we could safely give out 10 or 50 pages directly to each backend,
then clearly the related data would be stored together in this
scenario -- each backend would be able to work through its lease of
contiguous heap pages for quite a while before revisiting the
FSM/relation extension. The trick is to teach the implementation to do
that without allowing the backend to permanently leak whole entire
leases with maybe dozens of free pages.

Systems like DB2 and Oracle probably can't have this problem. Even the
bulk lease case is just an extension of something they had to do
anyway. You see, some kind of FSM lease system that knows about
transaction lifetime for any given piece of free space is strictly
necessary with UNDO based rollback. Without that, transaction rollback
breaks in certain edge cases involving concurrent inserts into the
same page, right before a rollback that needs to put an updated
version back in place. If the updated version from undo happens to be
physically larger than the now-aborted successor version, and if there
is little or no space left to cover that, what can rollback do about
it? Nothing. It's a nasty bug. They use heavyweight locks to avoid
this.

> Is that the non-temporal memory issue you are concerned
> about?

When I talk about memory or time, what I'm usually referring to is the
ability to manage larger allocations of multiple blocks for a while
after they're originally requested. So that a requesting
transaction/backend is given the opportunity to use the space that
they asked for. They shouldn't be completely trusted. We should trust
but verify.

> Doesn't spreading the new rows out increase the ability to do
> HOT updates?

It's true that the problem scenario with TPC-C does involve updates,
and it's true that that's when things really go badly. But I
deliberately haven't gone into the role that the updates play there
too much. Not just yet.

It's important to note that the problems really do start when we
insert, even if that's not obvious -- that's when the locality is
lost, right when the original order transaction comes in. We lose
locality just because we have concurrent inserters into the same
table, where each transaction inserts multiple related rows. We fail
to group related rows together right from the start, setting us up for
failure later on.

This is actually quite simple -- the chaos that follows is where it
gets truly complicated (and not necessarily worth getting into just
yet). The update of a given order and its order lines entries takes
place hours later, within a delivery transaction.

Another concern is knock on effects *after* any initial problematic
updates -- that's certainly not where it ends. Every action has
consequences, and these consequences themselves have consequences. By
migrating a row from an earlier block into a later block (both
physically and temporally early/late), we're not just changing things
for that original order row or rows (the order non-HOT-updated by the
delivery transaction). Since the old/original version left behind by a
delivery transaction will itself get vacuumed eventually, that space
goes in the FSM eventually. And so this same block will get some entirely
unrelated version (could be a new insert or update). It's systematic.

We dare not waste a tiny amount of free space. Surely we should cut
our losses at some point and accept that we're "wasting space", but
there is never any natural backpressure. Nothing to break the cycle of
fragmentation. (This is another thing involving time, not quite
related to my central point about free space leases.)

--
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Another concern is knock on effects *after* any initial problematic
> updates -- that's certainly not where it ends. Every action has
> consequences, and these consequences themselves have consequences. By
> migrating a row from an earlier block into a later block (both
> physically and temporally early/late), we're not just changing things
> for that original order row or rows (the order non-HOT-updated by the
> delivery transaction).

To be clear, TPC-C looks like this: each order row and each order line
row will be inserted just once, and then later updated just once. But
that's it, forever -- no more modifications. Both tables grow and grow
for as long as the benchmark runs. Users with workloads like this will
naturally expect that performance will steady over time. Even over
days or even weeks. But that's not what we see.

What we actually see is that the FSM can never quite resist the
temptation to dirty older pages that should be aging out. And so
performance degrades over days and weeks -- that's how long Jan has
said that it can take.

The FSM does have a bias in favor of using earlier pages in favor of
later pages, in order to make relation truncation by VACUUM more
likely in general. That bias certainly isn't helping us here, and
might be another thing that hurts us. I think that the fundamental
problem is that the FSM just doesn't have any way of recognizing that
it's behavior is penny wise, pound foolish. I don't believe that there
is any fundamental reason why Postgres can't have *stable* long term
performance for this workload in the way that it's really expected to.
That seems possible within the confines of the existing design for HOT
and VACUUM, which already work very well for the first few hours.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Andres Freund
Date:
Hi,

On 2021-08-16 10:35:45 -0700, Peter Geoghegan wrote:
> Problems
> --------
>
> The FSM gives out space without considering the passage of time, or
> the likelihood that a particular transaction or client will consume
> its requested free space in a more or less predictable and steady way
> over time. It has no memory. The FSM therefore fails to capture
> naturally occurring locality, or at least doesn't specifically care
> about it. This is the central problem, that all other problems with
> the FSM seem to stem from in one way or another.

I suspect that one other fairly fundamental issue is that we are very
reticent updating the FSM with information about free space. As long as
that's the case a smarter placement logic would often not have a better
place to choose from.


> Background
> ----------
>
> To recap, the FSM tracks how much free space is in every heap page.
> Most FSM maintenance takes place during VACUUM, though ordinary
> connection backends occasionally help to keep the FSM current, via
> calls to RecordAndGetPageWithFreeSpace() made from hio.c.
>
> There is also a bulk extension mechanism added by commit 719c84c1be,
> which is used when we detect multiple concurrent inserters. This bulk
> extension mechanism does change things a little (it gives some thought
> to systematic effects), though it still has the same basic issues: it
> doesn't do nearly enough to group likely-related rows together. I
> suspect that improving the bulk extension mechanism will be an
> important part of improving the overall picture.

I think the design of the current bulk extension mechanism is quite backward
(it only takes contention into account not other needs for bulk work, it does
expensive stuff like finding victim buffers under lock, it doesn't release
locks until all blocks are extended, ...).  But: I don't think a bulk
extension mechanism should be tasked with doing grouping or anything like
that.


> Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
> ever end up inserting *any* new logical rows on a heap page after the
> page first fills -- it is initially "open" when first allocated, and
> then quickly becomes "closed" to inserts of new logical rows, once it
> crosses a certain almost-empty space threshold. The heap page usually
> stays "closed" forever. While the design does not *completely* ignore
> the possibility that the page won't eventually get so empty that reuse
> really does look like a good idea, it makes it rare. A heap page needs
> to have rather a lot of the original logical rows deleted before that
> becomes possible again. It's rare for it to go backwards because the
> threshold that makes a closed page become open again is much lower
> (i.e. involves much more free space) than the threshold that initially
> made a (newly allocated) heap page closed. This one-way stickiness
> quality is important.

I suspect that we'd get a *LOT* of complaints if we introduced that degree of
stickiness.


> This stickiness concept is called "hysteresis" by some DB researchers,
> often when discussing UNDO stuff [8]. Having *far less* granularity
> than FSM_CATEGORIES/255 seems essential to make that work as intended.
> Pages need to be able to settle without being disturbed by noise-level
> differences. That's all that super fine grained values buy you: more
> noise, more confusion.

Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
from a fairly granular FSM_CATEGORIES, while still achieving better
locality. You could e.g. search for free space for an out-of-page update in
nearby pages with a lower free-percentage threshold, while having a different
threshold and selection criteria for placement of inserts.


> Visibility map
> --------------
>
> If the logical database and natural locality are important to the FSM,
> then what about the visibility map? And what about the relationship
> between the FSM and the visibility map, in light of all this?
>
> Currently VACUUM doesn't care about how its FSM behavior interacts
> with how it sets all-frozen/all-visible bits for the same heap page.
> To me this seems completely unreasonable -- they're *obviously*
> related! We're probably only gaining a minimal amount of free space on
> one occasion by ignoring the VM/FSM relationship, for which we pay a
> high cost. Worst of all we're *perpetuating the cycle* of dirtying and
> redirtying the same pages over time. Maybe we should go as far as
> merging the FSM and VM, even -- that seems like a natural consequence
> of logical-ish/qualitative definitions of "page is full".

The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
merging FSM and VM is a good direction to go to. Additionally we need to make
sure that the VM is accurately durable, which isn't the case for the FSM
(which I think we should use to maintain it more accurately).

But perhaps I'm just missing something here. What is the strong interlink
between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM
that the main linkage is on the side of inserts/update that choose a target
page from the FSM. Isn't it better to make that side smarter?



> Imagine if the FSM tracked recent free space requests that have been
> satisfied already. Requesting backends will maintain this same
> metadata when they burn through their bulk allocation of reserved
> space/heap pages. This new structure would contain metadata like the
> requesting XID/backend for a given lease of heap pages, the number of
> leases during the last bulk extension operation, and maybe the number
> of concurrently waiting backends at that time. Now we have some sense
> of recent history, and start to recognize trends over time. The FSM is
> now able to *take back* some of the space that it gives out, based on
> new information. Now the FSM can afford to make mistakes because the
> mistakes can always be corrected a little later. The FSM can be
> generous based on speculation, heuristics, whatever it might be. It
> should be possible for the FSM to have it both ways, more or less.

Hm. To me this seems like a somewhat separate layer from the FSM. Having that
kind of data go through shared buffers, with a need to worry about on-disk
consistency/cleanup etc, the inability to add granular locks on a sub-page
level, the need to go through buffer mapping / pinning all seem like things
you wouldn't actually want.

To me this seems like it'd be better addressed by a shared, per-relfilenode,
in-memory datastructure. Thomas Munro has been working on keeping accurate
per-relfilenode relation size information. ISTM that that that'd be a better
place to hook in for this.


> would be quite natural for the leader backend to say to each individual
> follower backend: thank you for waiting for me, here is a big contiguous
> range of heap pages (e.g., as many as 50 heap pages or more per xact when
> things truly ramp up), which should be enough to allow you to avoid going to
> the FSM again for quite a long time.

That'd end up badly if we did for a relation that's hotly inserted into by a
lot of connections, but where each connection only inserts a few rows. I
suspect we'd either need information from the inserting backends about the
number of pages they're likely going to need (e.g. a smarter "extension queue"
where each entry lists the number of pages requested) and/or stats about the
number of consecutive insertions a relation typically gets.


> But the leader backend doesn't do that at all -- it just puts all of
> the pages (up to 512 total) in the FSM. All of the waiting follower
> backends are left to senselessly fight it out for free space by simply
> going back to the FSM when they wake up again. Why can't they
> cooperate intelligently? What stops that today?

The lack of suitable in-memory coordination facilities.

It turns out that I looked a bunch at this area because of AIO + DIO: With DIO
the typical linux filesystems (ext4, xfs) do not perform any delayed
allocation, which means that the non-bulk extension leads to *horrible*
filesystem fragmentation. And even the bulk paths don't allocate enough blocks
at once to avoid fragmentation with common amounts of concurrency.  But just
increasing the degree of bulk extension isn't helpful - it just exacerbates
already bad thundering herd issues: All backends will decide to try to extend
at the same time, all-1 backends will wait, one backend will extend wake up
all-1, they then contend for the same locks etc.

I addressed (or rather sidestepped) this to a small degree in the AIO+DIO
patchset, by making extension faster and doing a few improvements around
locking. But it's not good enough.

If the relation extension infrastructure instead had information about the
number of blocks each waiter wants and we had Thomas' "accurate shared
relation size" infrastructure, we could do a lot better:

1) Whenever a backend needs new space and most of the previously bulk-extended
   space is used opportunistically try to bulk-extend. In the happy path this
   avoids the thundering herd issue because other backends can continue
   inserting while extension is in progress.

2) Whenever bulk extending, release pages to individual backends by looking in
   the queue of backends needing extension and wake up backends individually
   whenever we've bulk extended sufficiently for that backend.

   This provides two main benefits: First, it avoids all backends trying to
   acquire exactly the same resources just after waking up, without being able
   to benefit (because the first backend will often just fill the
   page). Second, it improves average latency during extension substantially,
   because individual backends can start continuing as soon as bulk extension
   has progressed enough for some backend, rather than needing to wait for the
   entire extension.



> FSM "rollback"
> -------------
>
> (This is where my argument comes together, finally.)
>
> FSM rollback in the FSIP paper is really just about reliable ownership
> semantics. Ownership of free space in heap pages that is scoped at the
> level of transactions.
>
> It seems to me that ownership of free space in heap pages by
> particular transactions is really all that the bulk extension FSM
> logic is missing -- and getting that right seems like a big part of
> fixing the FSM comprehensively. The bulk extension mechanism cannot
> *trust* the backends to use more than a little space at a time today.
> While on average each backend's use of free space probably *is*
> predictable over time, it only takes one or two straggler backends per
> bulk operation to mess everything up -- that's enough for the system
> to end up with very poor space utilization. Since we don't have any
> sense of ownership of space in heap pages by transactions today, the
> code must hedge against stragglers/space leaks by making the space
> *generally* available to all. Of course, this hedging comes at a high
> cost: it prevents the whole system (heapam, FSM, and backends) from
> capturing naturally occurring locality in a variety of important
> cases. Including the important TPC-C case I mentioned.

I think the fundamental issue of bulk extension issue is more that bulk
extension shouldn't actually get pages from the FSM at all. That's just an
ugly implementation detail of the hack we have currently, rather than
something that should live on.  We may want to decide to continue inserting
bulk extended pages into the FSM to handle edge cases of backends erroring out
before space is used, but the happy path should not touch the FSM at all.


> Explicit scoped ownership of free space in heap pages removes the need
> to hedge. It addresses the problem of would-be space leaks directly,
> while *also* fixing the FSM locality problems implicitly. Metadata
> about recent requests gives us the *context* needed to do all this
> well.

It sounds like you're thinking of using some global history, or at least a
backend local history? I think it might be better to have backend local,
per-command state. E.g. for a COPY aggressively increase the amount of space
reserved, but not for a single-row insert.



> I now wonder if the FSM is fundamentally doing the wrong thing by
> keeping track of all "free space" in every page over time. Why
> wouldn't we just have a free space management strategy that is
> generally only concerned with recent events?

I guess it depends on what you define as recent, but I don't see a way to
bound the timeframe usefully. We need to deal with cases where table sizes go
up and down over time, e.g. because there's periodic batch activity and
ongoing activity. What's the time bound for unused space somewhere in a
relation? I don't see any, unless we add compaction logic to vacuum.

Greetings,

Andres Freund



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Tue, Aug 17, 2021 at 9:18 AM Andres Freund <andres@anarazel.de> wrote:
> > Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
> > ever end up inserting *any* new logical rows on a heap page after the
> > page first fills -- it is initially "open" when first allocated, and
> > then quickly becomes "closed" to inserts of new logical rows, once it
> > crosses a certain almost-empty space threshold. The heap page usually
> > stays "closed" forever. While the design does not *completely* ignore
> > the possibility that the page won't eventually get so empty that reuse
> > really does look like a good idea, it makes it rare. A heap page needs
> > to have rather a lot of the original logical rows deleted before that
> > becomes possible again. It's rare for it to go backwards because the
> > threshold that makes a closed page become open again is much lower
> > (i.e. involves much more free space) than the threshold that initially
> > made a (newly allocated) heap page closed. This one-way stickiness
> > quality is important.
>
> I suspect that we'd get a *LOT* of complaints if we introduced that degree of
> stickiness.

I don't know what the right degree of stickiness is, but I can easily
believe that we want to have more than none. Part of Peter's point, at
least as I understand it, is that if a page has 100 tuples and you
delete 1 or 2 or 3 of them, it is not smart to fill up the space thus
created with 1 or 2 or 3 new tuples. You should instead hope that
space will optimize future HOT updates, or else wait for the page to
lose some larger number of tuples and then fill up all the space at
the same time with tuples that are, hopefully, related to each other
in some useful way. Now what's the threshold? 20 out of 100? 50? 80?
That's not really clear to me. But it's probably not 1 out of 100.

> > This stickiness concept is called "hysteresis" by some DB researchers,
> > often when discussing UNDO stuff [8]. Having *far less* granularity
> > than FSM_CATEGORIES/255 seems essential to make that work as intended.
> > Pages need to be able to settle without being disturbed by noise-level
> > differences. That's all that super fine grained values buy you: more
> > noise, more confusion.
>
> Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
> from a fairly granular FSM_CATEGORIES, while still achieving better
> locality. You could e.g. search for free space for an out-of-page update in
> nearby pages with a lower free-percentage threshold, while having a different
> threshold and selection criteria for placement of inserts.

Yeah, I don't think that reducing FSM_CATEGORIES is likely to have
much value for its own sake. But it might be useful as a way of
accomplishing some other goal. For example if we decided that we need
a bit per page to track "open" vs. "closed" status or something of
that sort, I don't think we'd lose much by having fewer categories.

> The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
> merging FSM and VM is a good direction to go to. Additionally we need to make
> sure that the VM is accurately durable, which isn't the case for the FSM
> (which I think we should use to maintain it more accurately).
>
> But perhaps I'm just missing something here. What is the strong interlink
> between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM
> that the main linkage is on the side of inserts/update that choose a target
> page from the FSM. Isn't it better to make that side smarter?

I don't have a well-formed opinion on this point yet. I am also
skeptical of the idea of merging the FSM and VM, because it seems to
me that which pages are all-visible and which pages are full are two
different things. However, there's something to Peter's point too. An
empty page is all-visible but still valid as an insertion target, but
this is not as much of a contradiction to the idea of merging them as
it might seem, because marking the empty page as all-visible is not
nearly as useful as marking a full page all-visible. An index-only
scan can't care about the all-visible status of a page that doesn't
contain any tuples, and we're also likely to make the page non-empty
in the near future, in which case the work we did to set the
all-visible bit and log the change was wasted. The only thing we're
accomplishing by making the page all-visible is letting VACUUM skip
it, which will work out to a win if it stays empty for multiple VACUUM
cycles, but not otherwise.

Conceptually, you might think of the merged data structure as a
"quiescence map." Pages that aren't being actively updated are
probably all-visible and are probably not great insertion targets.
Those that are being actively updated are probably not all-visible and
have better chances of being good insertion targets. However, there
are a lot of gremlins buried in the word "probably." A page could get
full but still not be all-visible for a long time, due to long-running
transactions, and it doesn't seem likely to me that we can just ignore
that possibility. Nor on the other hand does the difference between an
old-but-mostly-full page and an old-but-mostly-empty page seem like
something we just want to ignore. So I don't know how a merged data
structure would actually end up being an improvement, exactly.

> To me this seems like it'd be better addressed by a shared, per-relfilenode,
> in-memory datastructure. Thomas Munro has been working on keeping accurate
> per-relfilenode relation size information. ISTM that that that'd be a better
> place to hook in for this.

+1. I had this same thought reading Peter's email. I'm not sure if it
makes sense to hook that into Thomas's work, but I think it makes a
ton of sense to use shared memory to coordinate transient state like
"hey, right now I'm inserting into this block, you guys leave it
alone" while using the disk for durable state like "here's how much
space this block has left."

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Bruce Momjian
Date:
On Mon, Aug 16, 2021 at 05:15:36PM -0700, Peter Geoghegan wrote:
> It doesn't make sense to have a local cache for a shared resource --
> that's the problem. You actually need some kind of basic locking or
> lease system, so that 10 backends don't all decide at the same time
> that one particular heap block is fully empty, and therefore a good
> target block for that backend alone. It's as if the backends believe
> that they're special little snowflakes, and that no other backend
> could possibly be thinking the same thing at the same time about the
> same heap page. And so when TPC-C does its initial bulk insert,
> distinct orders are already shuffled together in a hodge-podge, just
> because concurrent bulk inserters all insert on the same heap pages.

OK, I am trying to think of something simple we could test to see the
benefit, with few downsides.  I assume the case you are considering is
that you have a 10 8k-page table, and one page is 80% full and the
others are 81% full, and if several backends start adding rows at the
same time, they will all choose the 80%-full page.

What if we change how we select pages with this:

1.  find the page with the most free space
2.  find all pages with up to 10% less free space than page #1
3.  count the number of pages in #2
4.  compute the proc_id modulus step #3 and use that page's offset from
    step #2

For example:

1.  page with most freespace is 95% free
2.  pages 2,4,6,8,10 have between 86%-95% free
3.  five pages
4.  proc id 14293 % 5 = 3 so use the third page from #2, page 6

This should spread out page usage to be more even, but still favor pages
with more freespace.  Yes, this is simplistic, but it would seem to have
few downsides and I would be interested to see how much it helps.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: The Free Space Map: Problems and Opportunities

From
John Naylor
Date:
On Mon, Aug 16, 2021 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> Open and closed pages
> ---------------------

> This stickiness concept is called "hysteresis" by some DB researchers,
> often when discussing UNDO stuff [8]. Having *far less* granularity
> than FSM_CATEGORIES/255 seems essential to make that work as intended.
> Pages need to be able to settle without being disturbed by noise-level
> differences. That's all that super fine grained values buy you: more
> noise, more confusion.

I'm not sure it's essential to have "far" fewer categories, if the closed-to-open transition is made less granular through some other mechanism. We can certainly get by with fewer categories, freeing up bits -- it seems we'd need at least one bit to track a block's open-close state.

> Visibility map
> --------------
>
> If the logical database and natural locality are important to the FSM,
> then what about the visibility map? And what about the relationship
> between the FSM and the visibility map, in light of all this?
>
> Currently VACUUM doesn't care about how its FSM behavior interacts
> with how it sets all-frozen/all-visible bits for the same heap page.
> To me this seems completely unreasonable -- they're *obviously*
> related! We're probably only gaining a minimal amount of free space on
> one occasion by ignoring the VM/FSM relationship, for which we pay a
> high cost. Worst of all we're *perpetuating the cycle* of dirtying and
> redirtying the same pages over time. Maybe we should go as far as
> merging the FSM and VM, even -- that seems like a natural consequence
> of logical-ish/qualitative definitions of "page is full".

> [...]

> I now wonder if the FSM is fundamentally doing the wrong thing by
> keeping track of all "free space" in every page over time. Why
> wouldn't we just have a free space management strategy that is
> generally only concerned with recent events? If we find a way to make
> almost all newly allocated heap pages become "closed" quickly (maybe
> they're marked all-frozen quickly too), and make sure that that
> condition is sticky, then this can work out well. We may not even need
> to store explicit freespace information for most heap pages in the
> database -- a page being "closed" can be made implicit by the FSM
> and/or VM. Making heap pages age-out like this (and mostly stay that
> way over time) has obvious benefits in many different areas.

The second paragraph here is an interesting idea and makes a great deal of sense. It would lead to smaller FSMs that are navigated more quickly and locked for shorter durations.

Implicit "closure" seems riskier in my view if you want to bring VM qualities into it, however. Currently, setting an all-visible or all-frozen flag must be correct and crash-safe, but clearing those is just a lost optimization. If either of those qualities are implicit by lack of reference, it seems more vulnerable to bugs.

On Tue, Aug 17, 2021 at 12:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 17, 2021 at 9:18 AM Andres Freund <andres@anarazel.de> wrote:
>
> > To me this seems like it'd be better addressed by a shared, per-relfilenode,
> > in-memory datastructure. Thomas Munro has been working on keeping accurate
> > per-relfilenode relation size information. ISTM that that that'd be a better
> > place to hook in for this.
>
> +1. I had this same thought reading Peter's email. I'm not sure if it
> makes sense to hook that into Thomas's work, but I think it makes a
> ton of sense to use shared memory to coordinate transient state like
> "hey, right now I'm inserting into this block, you guys leave it
> alone" while using the disk for durable state like "here's how much
> space this block has left."

This makes sense as well. Shared memory for more recent / highly contended state, and disk for less recent / less contended / stickier state. This also might have the advantage of smaller, more focused projects from a coding standpoint.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Tue, Aug 17, 2021 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:
> I suspect that one other fairly fundamental issue is that we are very
> reticent updating the FSM with information about free space. As long as
> that's the case a smarter placement logic would often not have a better
> place to choose from.

I think so too. But that is made a lot harder by the supposed need to
represent the amount of freespace in FSM_CATEGORIES (255) distinct
increments. As opposed to maybe 4 distinct increments per page, like
in other systems. You only have to update the externally stored
metadata when a threshold is crossed.

(This isn't the main reason why I don't like that; more on that later.)

> I think the design of the current bulk extension mechanism is quite backward
> (it only takes contention into account not other needs for bulk work, it does
> expensive stuff like finding victim buffers under lock, it doesn't release
> locks until all blocks are extended, ...).  But: I don't think a bulk
> extension mechanism should be tasked with doing grouping or anything like
> that.

Maybe I'm just drawing the boundaries differently in my head, which
doesn't seem like a real difference of opinion.

> > Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
> > ever end up inserting *any* new logical rows on a heap page after the
> > page first fills -- it is initially "open" when first allocated, and
> > then quickly becomes "closed" to inserts of new logical rows, once it
> > crosses a certain almost-empty space threshold.

> I suspect that we'd get a *LOT* of complaints if we introduced that degree of
> stickiness.

Maybe. It would be very different to our current behavior, which would
undoubtedly have many consequences. Maybe this will be a big problem,
maybe it barely matters. I make no specific claims about it, either
way. Not just yet.

> Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
> from a fairly granular FSM_CATEGORIES, while still achieving better
> locality. You could e.g. search for free space for an out-of-page update in
> nearby pages with a lower free-percentage threshold, while having a different
> threshold and selection criteria for placement of inserts.

Again, it's all about *systemic* effects. A FSM request is basically a
choice *among* blocks that could in principle satisfy the request --
often the number of basically-satisfactory blocks is huge (way too
large, even). You have to bear in mind that concurrent requests are in
competition with each other in much of the time. They are usually all
looking for exactly the same thing (e.g., same free space
requirement), or close to the same thing -- tuples within a given
table tend to all be about as wide as each other.

I can clearly sometimes see very high contention cases, where backends
do way too much spinning inside the RelationGetBufferForTuple() loop.
They're all more or less chasing the same scraps of free space, in a
highly inefficient way. Even though there is probably an abundance of
free space. Right now the heap pages that have the most free space are
also the ones that are least likely to be used.

Though I think that these backends should cooperate more, some amount
of competition is probably necessary. If FSM_CATEGORIES used 3 bits
instead of 8, then the backends could fall back on caring about some
other thing that distinguished heap pages from each other that
actually matters. Leading to less contention, and maybe better space
utilization.

I have presented this FSM_CATEGORIES contention issue as a distinct
problem to the first one (the locality problem), though honestly that
was a bit arbitrary -- it's just blurry. Thank you for putting up with
that.

> The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
> merging FSM and VM is a good direction to go to. Additionally we need to make
> sure that the VM is accurately durable, which isn't the case for the FSM
> (which I think we should use to maintain it more accurately).

I explained this part badly. I don't think that we should literally
unite the VM and FSM data structures. I was really just pointing out
that a logical-ish notion of a page being full (to use the term of art
I've seen, a "closed" page) is almost the same thing as a page that is
marked all-visible/all-frozen. Presumably VACUUM does that with the
optimistic assumption that it will probably never again need to touch
the same page. And while that can never be guaranteed, it certainly
seems like we might want to weigh things in favor of that happening.

I'm pretty sure that carelessly remembering 200 bytes of free space in
the FSM for an all-frozen page is practically guaranteed to be a bad
idea. While I don't have a great idea of where the break-even point is
just yet, I am prepared to say that I'm sure that it's not 0 bytes,
which is how it works today. A squishy logical definition of "page is
full" seems pretty natural to me.

> But perhaps I'm just missing something here. What is the strong interlink
> between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM
> that the main linkage is on the side of inserts/update that choose a target
> page from the FSM. Isn't it better to make that side smarter?

I'll respond to Robert about this separately.

> To me this seems like it'd be better addressed by a shared, per-relfilenode,
> in-memory datastructure. Thomas Munro has been working on keeping accurate
> per-relfilenode relation size information. ISTM that that that'd be a better
> place to hook in for this.

Actually, that's the approach that I'm taking in my current
prototyping. I would like to have a WAL-logged component to this as
well, but I haven't got that far.

I think that it's possible that I'll eventually conclude that the
whole idea of a FSM is just the wrong idea. Remembering anything at
all about most pages may just be unnecessary in a world where we fix
everything. Maybe the end result is that the common case is that most
individual heap pages are "closed". Perhaps that can just be implicit
-- there is no information to remember.

> > would be quite natural for the leader backend to say to each individual
> > follower backend: thank you for waiting for me, here is a big contiguous
> > range of heap pages (e.g., as many as 50 heap pages or more per xact when
> > things truly ramp up), which should be enough to allow you to avoid going to
> > the FSM again for quite a long time.
>
> That'd end up badly if we did for a relation that's hotly inserted into by a
> lot of connections, but where each connection only inserts a few rows. I
> suspect we'd either need information from the inserting backends about the
> number of pages they're likely going to need (e.g. a smarter "extension queue"
> where each entry lists the number of pages requested) and/or stats about the
> number of consecutive insertions a relation typically gets.

That's a part of it, certainly. The heuristics would mostly be driven
by recent information. In particular, we would probably only ramp
things up based on a clear observed failure to keep up with demand. We
could be very aggressive indeed this way, without too much extra risk.
We'd only ramp up because we *knew* that the most recent batch/set of
free lists for the last bulk request was inadequate.

For that you need the context, the recent history.

> > But the leader backend doesn't do that at all -- it just puts all of
> > the pages (up to 512 total) in the FSM. All of the waiting follower
> > backends are left to senselessly fight it out for free space by simply
> > going back to the FSM when they wake up again. Why can't they
> > cooperate intelligently? What stops that today?
>
> The lack of suitable in-memory coordination facilities.
>
> It turns out that I looked a bunch at this area because of AIO + DIO: With DIO
> the typical linux filesystems (ext4, xfs) do not perform any delayed
> allocation, which means that the non-bulk extension leads to *horrible*
> filesystem fragmentation.

It's the coordination facilities -- agreed. The only way that I seem
to be going further than what you've said here is the logical database
stuff. That is, I intend to specifically involve metadata like XIDs,
as well as a general understanding of things like commit/abort
boundaries for said XIDs. Ownership of a free space lease shouldn't be
strictly tied to an XID, but that seems like a good starting point to
me. This is the kind of thing that an OS engineer would never think
of, that other database systems seem to do a lot of.

This design involving waiting on XIDs could also be reused for nbtree
page deletion, where page recycling is deferred (same with GiST page
deletion). We could model this problem as: the
physically-deleted-by-VACUUM page's safexid value should be treated as
"the maybe-still-ongoing transaction that owns this page". This is
kind of a lie, but close enough to the truth. That way we can put
every deleted page in the FSM/whatever immediately, while reusing the
same XID-wait stuff to solve the recycle safety problem.

We push the recycle safety stuff on to the consumer side this way.
This is currently a problem that mostly belongs to VACUUM itself (the
producer side) -- which makes zero sense. I think that the only reason
we do it that way is because the current FSM doesn't care at all about
XID boundaries. It's almost the same problem, maybe even exactly the
same.

> I addressed (or rather sidestepped) this to a small degree in the AIO+DIO
> patchset, by making extension faster and doing a few improvements around
> locking. But it's not good enough.
>
> If the relation extension infrastructure instead had information about the
> number of blocks each waiter wants and we had Thomas' "accurate shared
> relation size" infrastructure, we could do a lot better:

Sounds great to me!

> 1) Whenever a backend needs new space and most of the previously bulk-extended
>    space is used opportunistically try to bulk-extend. In the happy path this
>    avoids the thundering herd issue because other backends can continue
>    inserting while extension is in progress.

Right -- my thoughts exactly.

> 2) Whenever bulk extending, release pages to individual backends by looking in
>    the queue of backends needing extension and wake up backends individually
>    whenever we've bulk extended sufficiently for that backend.

This is also exactly what I was thinking of, albeit for slightly
different reasons.

I want to more or less say directly to each backend at the point it
wakes up: here you go, here are those pages you asked for. Somebody
might check in about this free space lease later on, but in general
you can think of these pages as your exclusive property.

Making it backend/XID specific is probably helpful in many ways. Like
maybe there is a natural variation in demand among backends, even
though they're all bulk inserting into the same table. Who knows?

> I think the fundamental issue of bulk extension issue is more that bulk
> extension shouldn't actually get pages from the FSM at all. That's just an
> ugly implementation detail of the hack we have currently, rather than
> something that should live on.

Actually, I agree. I was just trying to use our existing terms to
introduce my ideas.

> We may want to decide to continue inserting
> bulk extended pages into the FSM to handle edge cases of backends erroring out
> before space is used, but the happy path should not touch the FSM at all.

You don't have to touch the FSM if you don't even have one in the
first place.   :-)

> It sounds like you're thinking of using some global history, or at least a
> backend local history? I think it might be better to have backend local,
> per-command state. E.g. for a COPY aggressively increase the amount of space
> reserved, but not for a single-row insert.

That seems quite possible. Not sure just yet.

This feels like the right general idea to me, even though many of the
details are not worked out.

> I guess it depends on what you define as recent, but I don't see a way to
> bound the timeframe usefully. We need to deal with cases where table sizes go
> up and down over time, e.g. because there's periodic batch activity and
> ongoing activity. What's the time bound for unused space somewhere in a
> relation? I don't see any, unless we add compaction logic to vacuum.

I don't want to bound the timeframe, exactly. I want to look at what
happened during the last bulk operation, and compare that to what I
see for the current/ongoing operation -- which could involve a fair
amount of details. The idea is to make decisions based on inferences
about what is *changing*. What's working, what's not working -- the
*direction* of things is at least as important as the state of things
at this exact point in time.

This does mean that you have to accept that some number of heap pages
can be "leaked" if inserts against a table that was being
bulk-extended suddenly go away completely. Of course it should be
possible to get this so-called leaked space back, just as soon as the
table receives more inserts (any inserts should do that). My current
best guess is that this will work fine in practice.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Tue, Aug 17, 2021 at 9:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't know what the right degree of stickiness is, but I can easily
> believe that we want to have more than none. Part of Peter's point, at
> least as I understand it, is that if a page has 100 tuples and you
> delete 1 or 2 or 3 of them, it is not smart to fill up the space thus
> created with 1 or 2 or 3 new tuples. You should instead hope that
> space will optimize future HOT updates, or else wait for the page to
> lose some larger number of tuples and then fill up all the space at
> the same time with tuples that are, hopefully, related to each other
> in some useful way.

That is a big part of it. But it gets worse than that. It is truly
insidious, once you follow the chain of problems over time -- maybe
over hours and hours. BenchmarkSQL makes this a lot easier.

Right now when we set heap fill factor to 85 (which is what
BenchmarkSQL does), we effectively have this chunk of special reserved
free space left behind on each page. The remaining 15% of tuple space
is of course reserved for successor versions of the same logical rows
-- those rows from the time that the page first fills (presumably they
all come from straight inserts). That's our starting point -- simple
enough.

The approach we take effectively gives up on the *whole entire page*
at literally the first sign of trouble -- the first time we fail to
hold a successor version on that same page. Even though it just well
have been a once-off perturbation. What we might call bad luck, as
opposed to an inevitability. For example, maybe ANALYZE ran that
second, which made the crucial difference for that one heap page that
one time. That's kind of enough to make the implementation assume all
bets are off, for the entire page -- it's as if we pessimistically
assume that there is no chance of keeping anything like the original
rows on the heap page, so we might as well let it all get mixed up
even more. This is way too pessimistic -- it's black and white
thinking.

While in general that level of pessimism might turn out to be
justified (anything is possible), we should hold the line and find out
for sure how bad it is. I happen to know that our problematic
BenchmarkSQL/TPC-C tables just don't have any skew, so this pessimism
is particularly wrong-headed there.

Bear in mind that the new unrelated tuples that ultimately replace our
original couldn't-fit tuples (probably after the next VACUUM) are
probably not "average tuples". Maybe they're newly inserted tuples,
which is bad in the BenchmarkSQL case, because they're all
*guaranteed* to be updated in the future (owing to the details of the
workload). With some other typical workload they might often be
successor versions from updates instead -- most likely for rows that
are disproportionately likely to be updated many times. It's truly
pernicious.

If we only accepted the original row migration, and didn't throw good
money after bad, then the cycle would be broken. To me this is a bit
like certain heuristic algorithms, where you technically accept a
suboptimal outcome to avoid getting stuck in a local maxima.

> Now what's the threshold? 20 out of 100? 50? 80?

I'm not going to pretend to know the answer. But I will point out that
one DB system whose heap fill factor defaults to 90 seems to have a
symmetric setting for the "open up page again" point -- the default
for that is 40. Not sure that that really matters to us, but that does
seem pretty low to me. It's very sticky indeed.

> Yeah, I don't think that reducing FSM_CATEGORIES is likely to have
> much value for its own sake. But it might be useful as a way of
> accomplishing some other goal. For example if we decided that we need
> a bit per page to track "open" vs. "closed" status or something of
> that sort, I don't think we'd lose much by having fewer categories.

This is probably one of my less important points. I addressed this
point in my remarks on this to Andres from earlier.

> I don't have a well-formed opinion on this point yet. I am also
> skeptical of the idea of merging the FSM and VM, because it seems to
> me that which pages are all-visible and which pages are full are two
> different things.

As I clarified in my email to Andres, this was just a bad choice of
words. I meant that they're at least conceptually much closer,
provided you're willing to accept my general view of things.

> However, there's something to Peter's point too. An
> empty page is all-visible but still valid as an insertion target, but
> this is not as much of a contradiction to the idea of merging them as
> it might seem, because marking the empty page as all-visible is not
> nearly as useful as marking a full page all-visible. An index-only
> scan can't care about the all-visible status of a page that doesn't
> contain any tuples, and we're also likely to make the page non-empty
> in the near future, in which case the work we did to set the
> all-visible bit and log the change was wasted. The only thing we're
> accomplishing by making the page all-visible is letting VACUUM skip
> it, which will work out to a win if it stays empty for multiple VACUUM
> cycles, but not otherwise.

That's a big part of it. The other thing is systemic effects, which is
related to the part of this email above about chain of events. I'll
talk about that some more now.

Lets say that VACUUM puts a non-zero usable amount of free space in
the FSM for a page that it marks all-visible/all-frozen at the same
time -- this is possible today, of course. To me this seems
contradictory, at least when there isn't much space -- which I believe
is common. It's at least a minor waste to set the VM bit.

Now let's say we change VACUUM to not waste effort on setting the heap
page -- we're now avoiding minor waste, which is an okay goal. But
we're also doing something else, that I find very objectionable: we're
effectively betting that the situation will improve for the same page
at some point in the future, during a future VACUUM. It is possible
that we could get just one or two inserts on the same page, and then
see it again, and then we're done. If that happens, we win our bet.
Otherwise, we lose our bet.

Here's the problem with making that bet: it's typically an awful bet.
The potential upside is clearly pretty low. And more importantly,
there is a decent chance that we'll actually get some row that is
going to mess everything up -- not an "average row" (whatever that may
mean). Again, very frequently updated rows are often what'll end up on
the page, which are enough to make the page never have its all-visible
bit set.

Here is why I find this *extremely* objectionable: we *actually* risk
experiencing a downside for the heap page by taking this awful bet
*many many times* in the future -- it's not just a once-off downside,
I believe. This is not in fact a choice about one heap page on one
occasion -- it is actually a *policy* affecting the page again and
again (and every other heap page too). It is only by making a
not-quite-physically-full page "closed" in the way that I advocate we
avoid taking this awful bet -- that is how it's possible to artificially *make*
this a question about one page on one occasion. We're bounding the
downside, lessening our exposure to the worst case. Intuitively, I
think that the possible downside is essentially unlimited, if we
assume systemic destabilization risks (which seems wise). Whereas the
possible upside of taking that awful bet is clearly, measurably small.
A once-off upside. Peanuts, really.

(Of course my argument won't work if you assume that even with the
"closed" page choice we naturally get more updates to one of the aged
frozen rows anyway. But we're counting on that not happening in either
scenario/with *either* strategy, so I don't think that it invalidates
anything I've said.)

> Conceptually, you might think of the merged data structure as a
> "quiescence map." Pages that aren't being actively updated are
> probably all-visible and are probably not great insertion targets.
> Those that are being actively updated are probably not all-visible and
> have better chances of being good insertion targets. However, there
> are a lot of gremlins buried in the word "probably." A page could get
> full but still not be all-visible for a long time, due to long-running
> transactions, and it doesn't seem likely to me that we can just ignore
> that possibility.

I think that we should opportunistically be setting the all-frozen bit
in heap pages (and freezing tuples earlier than required) anyway. For
example, we could notice that a whole heap page is eligible to be
marked all-frozen, and freeze the tuples early -- we could restart
processing of the page once we realized that freezing early is a good
idea.

Why even have a distinct all-visible VM bit? The per-tuple overhead of
the WAL records for freezing are kind of bulky, but that could be
fixed. It's not a good reason. (Of course I'm not arguing that we
should break pg_upgrade, just that there is no reason not to freeze
early when it's cheap to do so in passing.)

> +1. I had this same thought reading Peter's email. I'm not sure if it
> makes sense to hook that into Thomas's work, but I think it makes a
> ton of sense to use shared memory to coordinate transient state like
> "hey, right now I'm inserting into this block, you guys leave it
> alone" while using the disk for durable state like "here's how much
> space this block has left."

I agree that that's the best starting point for this.

--
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Tue, Aug 17, 2021 at 11:24 AM Bruce Momjian <bruce@momjian.us> wrote:
> OK, I am trying to think of something simple we could test to see the
> benefit, with few downsides.  I assume the case you are considering is
> that you have a 10 8k-page table, and one page is 80% full and the
> others are 81% full, and if several backends start adding rows at the
> same time, they will all choose the 80%-full page.

That's one problem (there is so many). They start with that same page,
and then fight each other over other pages again and again. This
emergent behavior is a consequence of the heuristic that has the FSM
look for the heap page with the least space that still satisfies a
given request. I mentioned this problem earlier.

> For example:
>
> 1.  page with most freespace is 95% free
> 2.  pages 2,4,6,8,10 have between 86%-95% free
> 3.  five pages
> 4.  proc id 14293 % 5 = 3 so use the third page from #2, page 6

Right now my main concern is giving backends/XIDs their own space, in
bulk. Mostly for concurrent bulk inserts, just so we can say that
we're getting the easy cases right -- that is a good starting point
for my prototype to target, I think. But I am also thinking about
stuff like this, as one way to address contention.

> This should spread out page usage to be more even, but still favor pages
> with more freespace.  Yes, this is simplistic, but it would seem to have
> few downsides and I would be interested to see how much it helps.

I thought about having whole free lists that were owned by XIDs, as
well as shared free lists that are determined by hashing MyProcPid --
or something like that.  So I agree that it might very well make
sense.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Tue, Aug 17, 2021 at 12:11 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
> I'm not sure it's essential to have "far" fewer categories, if the closed-to-open transition is made less granular
throughsome other mechanism.
 

See my remarks to Andres about FSM_CATEGORIES from earlier. (It may
not matter if you don't believe that it helps us to have such high
granularity -- that means it won't hurt to get rid of it, which might
be a good enough reason on its own.)

> > I now wonder if the FSM is fundamentally doing the wrong thing by
> > keeping track of all "free space" in every page over time. Why
> > wouldn't we just have a free space management strategy that is
> > generally only concerned with recent events?

> The second paragraph here is an interesting idea and makes a great deal of sense. It would lead to smaller FSMs that
arenavigated more quickly and locked for shorter durations.
 

Most importantly, the FSM/whatever doesn't have to be locked at all if
you don't go to it. And so going to the FSM much less often with space
leases or whatever seems like it might be a big win for that reason
(not just the locality stuff that I care about).

> Implicit "closure" seems riskier in my view if you want to bring VM qualities into it, however.

I didn't mean to imply that we'd be changing that data structure. As I
clarified earlier, "merging the FSM and the VM" was meant in a
conceptual way. Not the best choice of words on my part.

> This makes sense as well. Shared memory for more recent / highly contended state, and disk for less recent / less
contended/ stickier state.
 

I would like to also make it WAL-logged, which is easier this way.
That isn't my main goal, but why not do it? The way that the FSM works
right now within VACUUM REDO routines is a problem.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Tue, Aug 17, 2021 at 6:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Again, it's all about *systemic* effects. A FSM request is basically a
> choice *among* blocks that could in principle satisfy the request --
> often the number of basically-satisfactory blocks is huge (way too
> large, even). You have to bear in mind that concurrent requests are in
> competition with each other in much of the time. They are usually all
> looking for exactly the same thing (e.g., same free space
> requirement), or close to the same thing -- tuples within a given
> table tend to all be about as wide as each other.
>
> I can clearly sometimes see very high contention cases, where backends
> do way too much spinning inside the RelationGetBufferForTuple() loop.
> They're all more or less chasing the same scraps of free space, in a
> highly inefficient way. Even though there is probably an abundance of
> free space. Right now the heap pages that have the most free space are
> also the ones that are least likely to be used.
>
> Though I think that these backends should cooperate more, some amount
> of competition is probably necessary. If FSM_CATEGORIES used 3 bits
> instead of 8, then the backends could fall back on caring about some
> other thing that distinguished heap pages from each other that
> actually matters. Leading to less contention, and maybe better space
> utilization.

To me, it feels like the real issue here is that the free space map is
completely deterministic. As it's currently designed, you can reduce
the number of bits as much as you want, and it doesn't change
anything. Concurrent requests get the same answer, which is not what
we want. Just as a thought experiment, imagine that backends
preferentially prefer pages whose block numbers are congruent to their
PID module some integer. Now you have a reason - perhaps not the best
possible reason, but *a* reason - for every single backend to pounce
on the same target page.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Tue, Aug 17, 2021 at 8:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Lets say that VACUUM puts a non-zero usable amount of free space in
> the FSM for a page that it marks all-visible/all-frozen at the same
> time -- this is possible today, of course. To me this seems
> contradictory, at least when there isn't much space -- which I believe
> is common. It's at least a minor waste to set the VM bit.

It seems to me that you are talking about the VM bit but making an
argument that is more applicable to the FSM. If we think that the
small amount of freespace in the page is going to result in the page
being dirtied repeatedly as it cycles through a bunch of different
tuples none of which stick around long term, then we might not want to
advertise the freespace in the FSM, or we might want the FSM to
disregard the fact that a small amount of freespace is present there.
However, if we do that, then we certainly want to also set the VM bit.
Otherwise, we've got a page that we've effectively quiesced, by
refusing to reuse the freespace, but the VM doesn't know that, so
future index-only scans and VACUUM operations are penalized.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Wed, Aug 18, 2021 at 7:36 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Though I think that these backends should cooperate more, some amount
> > of competition is probably necessary. If FSM_CATEGORIES used 3 bits
> > instead of 8, then the backends could fall back on caring about some
> > other thing that distinguished heap pages from each other that
> > actually matters. Leading to less contention, and maybe better space
> > utilization.
>
> To me, it feels like the real issue here is that the free space map is
> completely deterministic.

I agree. Though note the separate problem with FSM_CATEGORIES that I
mentioned to Andres: if you want to maintain the FSM value for pages
eagerly within backends, then practically speaking a far lower
FSM_CATEGORIES-like constant seems necessary.

I'm really just saying that the fine granularity seems to me to be
basically the wrong approach. In general the single most important
thing is the amount of free space available -- we agree on that.
However, the only reason to remember very small *differences* in free
space between heap pages is because you intend to do *something* with
that extra information. But caring about noise-level differences seems
inherently bad -- there is bound to be a better fall-back behavior
that reduces contention, and actually *increases* space utilization in
practice (even though theoretically coarser-grained information might
hurt utilization a little).

> As it's currently designed, you can reduce
> the number of bits as much as you want, and it doesn't change
> anything. Concurrent requests get the same answer, which is not what
> we want.

I agree that reducing FSM_CATEGORIES alone will not buy us anything.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Wed, Aug 18, 2021 at 7:45 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 17, 2021 at 8:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Lets say that VACUUM puts a non-zero usable amount of free space in
> > the FSM for a page that it marks all-visible/all-frozen at the same
> > time -- this is possible today, of course. To me this seems
> > contradictory, at least when there isn't much space -- which I believe
> > is common. It's at least a minor waste to set the VM bit.
>
> It seems to me that you are talking about the VM bit but making an
> argument that is more applicable to the FSM. If we think that the
> small amount of freespace in the page is going to result in the page
> being dirtied repeatedly as it cycles through a bunch of different
> tuples none of which stick around long term, then we might not want to
> advertise the freespace in the FSM, or we might want the FSM to
> disregard the fact that a small amount of freespace is present there.

I'm not quite sure if my argument applies to the VM or the FSM. It's a
bit tricky to talk about because I'm imagining a world in which the
concepts have shifted. And because the exact details of the data
structures are less clear (even if they were clear to me, the lack of
a shared vocabulary might confuse our discussion).

Maybe there isn't even a conventional FSM in this new world. Maybe
free space management works by focusing on recent events, and on
outcomes. This means that we store much less information about the
entire database, and much more information about very recent events.
In particular, I hope that free space management won't have to care
about how much free space most individual heap pages have. Perhaps
most heap pages are simply "closed", unavailable for reuse.

> However, if we do that, then we certainly want to also set the VM bit.
> Otherwise, we've got a page that we've effectively quiesced, by
> refusing to reuse the freespace, but the VM doesn't know that, so
> future index-only scans and VACUUM operations are penalized.

I agree that if we do that we'd want to also set the VM bit.

I'm imagining a world in which a "closed page" is synonymous with an
all-visible/all-frozen page. That doesn't mean that they're exactly
the same thing. For example, maybe a closed page is a page that is at
least *expected* to have its all-visible/all-frozen set by VACUUM the
next time it runs -- though more often it'll have been set already.

If you're looking at the system in real time with this improved
design, you'll see that pages that become closed will then also have
their VM bit set on a predictable timeline. Maybe we even set the VM
bit when some of the closed page's closed rows happen to be updated
(Freeze cutoff permitting) -- there isn't that much downside to VACUUM
betting that these "closed row updates" will be the last such updates
forever, or at least for a long time. This should at least be the case
once we add an optimized approach to freezing whole pages (pages whose
tuples start out unfrozen).

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Tue, Aug 17, 2021 at 5:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Now what's the threshold? 20 out of 100? 50? 80?
>
> I'm not going to pretend to know the answer. But I will point out that
> one DB system whose heap fill factor defaults to 90 seems to have a
> symmetric setting for the "open up page again" point -- the default
> for that is 40. Not sure that that really matters to us, but that does
> seem pretty low to me. It's very sticky indeed.

Correction: it's actually 60, not 40.

It's true that the actual default is 40, but it works the other way
around relative to Postgres (as does the related fill factor like
setting, which defaults to 90 instead of 100). And so we would think
of this other "open up closed page once again" setting as having a
default of 60. (Or perhaps we'd think of it as having a default that
is 2/3 of the closely related fill factor setting's default.)

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Wed, Aug 18, 2021 at 3:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Aug 17, 2021 at 5:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > Now what's the threshold? 20 out of 100? 50? 80?
> >
> > I'm not going to pretend to know the answer. But I will point out that
> > one DB system whose heap fill factor defaults to 90 seems to have a
> > symmetric setting for the "open up page again" point -- the default
> > for that is 40. Not sure that that really matters to us, but that does
> > seem pretty low to me. It's very sticky indeed.
>
> Correction: it's actually 60, not 40.
>
> It's true that the actual default is 40, but it works the other way
> around relative to Postgres (as does the related fill factor like
> setting, which defaults to 90 instead of 100). And so we would think
> of this other "open up closed page once again" setting as having a
> default of 60. (Or perhaps we'd think of it as having a default that
> is 2/3 of the closely related fill factor setting's default.)

I don't know whether 60 is optimal or not, but it doesn't seem crazy.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Wed, Aug 18, 2021 at 1:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Maybe there isn't even a conventional FSM in this new world. Maybe
> free space management works by focusing on recent events, and on
> outcomes. This means that we store much less information about the
> entire database, and much more information about very recent events.
> In particular, I hope that free space management won't have to care
> about how much free space most individual heap pages have. Perhaps
> most heap pages are simply "closed", unavailable for reuse.

I very much doubt that you can get away without some sort of free
space map. Even if in most cases most pages are closed to insertions,
there will be important corner cases where lots of pages are open for
insertions, like when you just deleted a ton of rows and then ran
VACUUM. And we cannot lose track of even one of those open pages or,
if the pre-8.4 state of the world is any indication, we will be super
sad. I suppose technically it doesn't need to be a map; it could be
structured some other way. But it is going to have to be able to keep
track of an arbitrary number of open pages without losing any, and a
map is one good way of doing that.

Regarding the FSM_CATEGORIES question, I think the issue is big tuples
vs. small tuples. You don't want to keep on finding pages with some
free space and then find out that they don't actually have enough for
the tuple you want to insert. Now that does become less of an issue
with this open/closed system because, at least initially, when a page
is re-opened to inserts, more than a quarter of the page will be free
(I presume) and so the details don't matter. But, as it gets closer to
filling up, it might matter again. Perhaps we don't want to waste time
looking through a bunch of pages with 1kB free when we're trying to
insert a 2kB tuple. We could close all those pages as we try them, to
prevent repeating the same process over and over again, but maybe the
2kB tuple we're now trying to insert is unusual, and most tuples are
40 bytes. Then we look silly.

One idea might be to have a CLOSED state in the FSM and then a variety
of OPEN states that are distinguished by amount of available free
space. For example, imagine something like CLOSED, OPEN (>2kb), OPEN
(1.5-2kb), OPEN (1-1.5kB), OPEN (768-1023 byte), OPEN (512-767 bytes),
OPEN (256-511 bytes), OPEN (128-255 bytes), OPEN (64-127 bytes), OPEN
(<64 bytes). I think that would give us enough information to pack
open pages tightly full before having them become closed. I don't know
if those are exactly the right boundaries, and 10 categories might be
worse than 8 or 16, but I think it's likely correct to suppose that
(a) we don't really care at all how much space is present in closed
pages, and (b) for open pages, exactitude is most important when the
amount of available space is small.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Fri, Aug 20, 2021 at 7:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't know whether 60 is optimal or not, but it doesn't seem crazy.

Uh, I had it right the first time. Only the fill factor setting is
"inverted relative to Postgres". This other setting really does
default to 40. So it's very low.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Fri, Aug 20, 2021 at 11:06 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Aug 20, 2021 at 7:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't know whether 60 is optimal or not, but it doesn't seem crazy.
>
> Uh, I had it right the first time. Only the fill factor setting is
> "inverted relative to Postgres". This other setting really does
> default to 40. So it's very low.

I expect they ran more than zero tests before selecting that value, so
it's probably a decent choice in their system. However, that does seem
rather low. I would have guessed that a good value would be in the
50-80 percent range. It's hard to know, though, partly because
everything is workload dependent, and partly because you're balancing
two good things that are qualitatively different. A lower value
figures to reduce the degree of "mixing" of older and newer data
within the same pages, but it also risks permanently wasting space
that could have been put to efficient use. Imagine a table that is
written some number of times and then, all at once, you stop all
writes forever. You probably don't want to discover at that point in
time that you have a large number of pages that are not even close to
full, but you can't know when, if ever, such an event will happen for
any given table.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Fri, Aug 20, 2021 at 7:45 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I very much doubt that you can get away without some sort of free
> space map. Even if in most cases most pages are closed to insertions,
> there will be important corner cases where lots of pages are open for
> insertions, like when you just deleted a ton of rows and then ran
> VACUUM. And we cannot lose track of even one of those open pages or,
> if the pre-8.4 state of the world is any indication, we will be super
> sad.

I agree with all that. The point I was making is that this new FSM
design will have an on-disk size that is "a function of the workload".
The new FSM can be required to store information about every single
page, but that is the worst case. And a rather atypical case. I
imagine that we'll find that the new FSM on-disk structure stores far
less information than the current FSM in most cases, even though we're
operating within the confines of what you've said.

I think of this whole area as making heap pages a bit like B-Tree leaf
pages. TIDs are stable logical identifiers of rows (that happen to
have a physical component, a block number) in the other DB systems
that I have referenced -- the heap pages from these systems are
therefore intrinsically more like B-Tree leaf pages than those from
Postgres heapam. ISTM that that's relevant to total space utilization.
Users with sparse deletion patterns in their heap structure will get
low space utilization -- an issue that we're familiar with as a
problem for B-Tree indexing.

I don't think that having a smaller on-disk FSM should be a goal of
this project (though I suppose you could say that that aspect enables
FSM WAL-logging, which would be nice). Smaller on-disk footprints seem
like a natural consequence of this whole direction -- that's all. I
also don't think that you're going to see huge space utilization
benefits. This project is mostly aimed at reducing fragmentation, and
all of the many problems that stem from it.

> I don't know
> if those are exactly the right boundaries, and 10 categories might be
> worse than 8 or 16, but I think it's likely correct to suppose that
> (a) we don't really care at all how much space is present in closed
> pages, and (b) for open pages, exactitude is most important when the
> amount of available space is small.

I really don't have a detailed opinion on the appropriate number of
categories just yet, except that it should be maybe 16 or 20 at the
very most -- only real testing is likely to help me to refine my
thinking on that. Note that the paper "Towards Effective and Efficient
Free Space Management" recommends logarithmic intervals (called
"buckets"), with 15 total. Details are under "4 Implementing Object
Placement".

I think that it's quite possible that the final scheme will not be a
linear scale. Plus we may have to account for fill factor settings.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Fri, Aug 20, 2021 at 8:34 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I expect they ran more than zero tests before selecting that value, so
> it's probably a decent choice in their system. However, that does seem
> rather low. I would have guessed that a good value would be in the
> 50-80 percent range.

They don't have to deal with non-HOT updates, which effectively move
the row to another block. Actually, that's not quite true -- they do
have a concept called row migration. But it's the strategy of last
resort, and is presumably very rare -- much rarer than a non-HOT
update that moves a row.

My point is that it's easier to believe that you run into sparse
deletion patterns in a version of Postgres with open and closed heap
pages -- because it's not only deletes that you have to worry about.

> It's hard to know, though, partly because
> everything is workload dependent, and partly because you're balancing
> two good things that are qualitatively different. A lower value
> figures to reduce the degree of "mixing" of older and newer data
> within the same pages, but it also risks permanently wasting space
> that could have been put to efficient use.

All true -- to me this is all about adapting to workloads at a fine granularity.

If we have closed pages, then non-HOT updates that cannot keep the row
on the same heap page now actually "relieve the pressure", which is
not currently true. If we started out with a heap page that had 100
tuples (say with fill factor 90), and then we find that we cannot keep
the entire page together...then maybe we'll have more luck with 99, or
just 90, or even less. By firmly sticking with our original goal, then
we have some chance to learn what really will be stable for a given
heap page. There is nothing wrong with learning that lesson through
trial and error, based on negative information from the workload.

Right now we go ahead and reuse the space at the next opportunity (we
particularly favor pages like this, even). So we never learn from our
mistakes.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Fri, Aug 20, 2021 at 11:48 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Aug 20, 2021 at 7:45 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I very much doubt that you can get away without some sort of free
> > space map. Even if in most cases most pages are closed to insertions,
> > there will be important corner cases where lots of pages are open for
> > insertions, like when you just deleted a ton of rows and then ran
> > VACUUM. And we cannot lose track of even one of those open pages or,
> > if the pre-8.4 state of the world is any indication, we will be super
> > sad.
>
> I agree with all that. The point I was making is that this new FSM
> design will have an on-disk size that is "a function of the workload".

That could be the right decision, but nothing said up to this point
really seems to suggest it. The open/closed distinction and the
changes in how to bin available space could all be done with the
present model.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Fri, Aug 20, 2021 at 9:03 AM Robert Haas <robertmhaas@gmail.com> wrote:
> That could be the right decision, but nothing said up to this point
> really seems to suggest it. The open/closed distinction and the
> changes in how to bin available space could all be done with the
> present model.

I guess that's true. Isn't this just semantics, though?

The new FSM implementation certainly cannot forget about pages that it
deems open, or fail to have information about pages that should be
considered open -- that would be absolutely unacceptable. And the
implementation cannot assume that it won't have to maintain
information about every single heap page. All I'm saying is that in
practice this will be very rare. Making the FSM structure smaller
isn't really the goal, though.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Fri, Aug 20, 2021 at 12:17 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Aug 20, 2021 at 9:03 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > That could be the right decision, but nothing said up to this point
> > really seems to suggest it. The open/closed distinction and the
> > changes in how to bin available space could all be done with the
> > present model.
>
> I guess that's true. Isn't this just semantics, though?

Not entirely. On one level, as long as we end up with an FSM
implementation that does good things, who cares how it works
internally? Well, the answer is, hackers care, and we're hackers, so
it doesn't seem out of bounds to talk about thoughts and ideas around
that. My own opinion is that the multi-level implementation of the
current FSM is confusing and doesn't seem to do what we want, but the
basic idea of a direct-mapped data structure seems good to me. Life is
a lot simpler if you can update the status of your own page without
having to do any complex calculation to figure out where that status
is stored, let alone having to shuffle things around to make room to
insert your status. Now if you or someone else comes up with an
absolutely amazing new design that isn't direct-mapped and it's
awesome, I'm not going to sit here and hold my breath; I like awesome
as well as the next person. But if you or someone else said to me
"hey, Robert, I'm thinking of rewriting the FSM, what tips do you have
for maximizing the chances of success?" I'd say "well, you should try
to make the replacement data structure as simple as it can be while
still having the other properties that you care about ... and in
particular I would suggest something that uses a fixed amount of state
per heap page, because that seems much simpler to me." But nobody's
got to agree with that, and it doesn't have to be right. It's just
what I think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Fri, Aug 20, 2021 at 10:15 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I guess that's true. Isn't this just semantics, though?
>
> Not entirely. On one level, as long as we end up with an FSM
> implementation that does good things, who cares how it works
> internally?

> I'd say "well, you should try
> to make the replacement data structure as simple as it can be while
> still having the other properties that you care about ... and in
> particular I would suggest something that uses a fixed amount of state
> per heap page, because that seems much simpler to me." But nobody's
> got to agree with that, and it doesn't have to be right. It's just
> what I think.

But not storing information about a heap page implicitly means that
the page is closed. And so we effectively remember the state of every
single heap page with the FSM design I am working on. Not storing any
information in the data structure is just an optimization. In any case
we're obligated to have accurate information about all individual heap
pages.

Doing it that way seems pretty natural to me, even though the overhead
of maintaining the FSM data structure isn't a priority for me -- not
at all. It might actually be easier to have that optimization than to
not have it. The starting point for the new FSM data structure is
lists of blocks that are usable, that have free space. Whereas the
starting point for the existing FSM structure is every single extant
heap page in the table -- it's strictly obligated to have explicit
information about every single page (otherwise we get nasty smgr
errors). It's either a small difference or a big difference, depending
on how you think about it.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Fri, Aug 20, 2021 at 10:32 AM Peter Geoghegan <pg@bowt.ie> wrote:
> But not storing information about a heap page implicitly means that
> the page is closed. And so we effectively remember the state of every
> single heap page with the FSM design I am working on. Not storing any
> information in the data structure is just an optimization.

> Doing it that way seems pretty natural to me, even though the overhead
> of maintaining the FSM data structure isn't a priority for me -- not
> at all. It might actually be easier to have that optimization than to
> not have it. The starting point for the new FSM data structure is
> lists of blocks that are usable, that have free space.

There is one problem with what I've said here, which might be behind
your concern. That problem is: if a page being closed is treated as an
implicit thing, just from its absence in this new FSM structure, then
we need to tighten things up elsewhere. In particular, this code at
the end of RelationGetBufferForTuple() is now a problem:

"""
/*
 * Remember the new page as our target for future insertions.
 *
 * XXX should we enter the new page into the free space map immediately,
 * or just keep it for this backend's exclusive use in the short run
 * (until VACUUM sees it)? Seems to depend on whether you expect the
 * current backend to make more insertions or not, which is probably a
 * good bet most of the time.  So for now, don't add it to FSM yet.
 */
RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
"""

Note that the relation bulk extension mechanism that you committed (in
commit 719c84c1be) will always allocate one more block, just for the
leader backend (the backend that bulk extends for other backends) --
we don't tell the FSM about this same block, presumably because to do
so would risk live lock inside the loop in the same function. We just
assume that it's okay that the leader backend remembers this in its
local relcache "target block".

Right now it isn't so bad that we're kind of sloppy here, at least
relative to everything else. Although the leader backend may well
"leak" the page before it has the opportunity to write very much to it
(e.g., client disconnects), in the long run we should get that space
back. At least we know that VACUUM will ultimately be able to find the
space again, and put it in the FSM for reuse. Since there is currently
no question about "closed vs open" pages, there is no question about
VACUUM/the FSM becoming confused about which state applies to which
page in this scenario.

In short: yeah, this "closed vs open" page business more or less
necessitates tightening things up here. Though this new requirement
seems to have always been a good idea. Just because we can lean on
VACUUM like this (an option that other DB systems don't have) doesn't
mean that we should do so.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Fri, Aug 20, 2021 at 3:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
> In short: yeah, this "closed vs open" page business more or less
> necessitates tightening things up here. Though this new requirement
> seems to have always been a good idea. Just because we can lean on
> VACUUM like this (an option that other DB systems don't have) doesn't
> mean that we should do so.

That's all true, but it's not what I was talking about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Fri, Aug 20, 2021 at 12:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 20, 2021 at 3:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > In short: yeah, this "closed vs open" page business more or less
> > necessitates tightening things up here. Though this new requirement
> > seems to have always been a good idea. Just because we can lean on
> > VACUUM like this (an option that other DB systems don't have) doesn't
> > mean that we should do so.
>
> That's all true, but it's not what I was talking about.

Then I am not sure that I have understood you fully. I would like to
make sure that I have the nuance of it. As long as we have the idea of
open vs closed pages then we cannot assume that we know that status
from the page alone. Unless we're willing to go update the pages
again, and possibly dirty them again, which seems unappealing.

My concern here is really the data structure and its overall
complexity. If there has to be an explicit state for every page on the
FSM, then this new FSM needs to merge freelists that have been emptied
into that structure, and needs to handle the question of which
structure (map or free lists) is currently authoritative for a given
heap page. That's a lot of complexity compared to just forgetting the
free lists that are known to be fully closed, which will probably be
very common. That can work a little like discarding old UNDO -- it can
be simple and cheap *because* it's largely logical and implicit at the
level of the physical data structures.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Fri, Aug 20, 2021 at 3:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> My concern here is really the data structure and its overall
> complexity. If there has to be an explicit state for every page on the
> FSM, then this new FSM needs to merge freelists that have been emptied
> into that structure, and needs to handle the question of which
> structure (map or free lists) is currently authoritative for a given
> heap page. That's a lot of complexity compared to just forgetting the
> free lists that are known to be fully closed, which will probably be
> very common. That can work a little like discarding old UNDO -- it can
> be simple and cheap *because* it's largely logical and implicit at the
> level of the physical data structures.

I don't think I understand the data structure that you have in mind
well enough to comment intelligently.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Fri, Aug 20, 2021 at 1:30 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 20, 2021 at 3:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > My concern here is really the data structure and its overall
> > complexity.

> I don't think I understand the data structure that you have in mind
> well enough to comment intelligently.

Right now my prototype has a centralized table in shared memory, with
a hash table. One entry per relation, generally multiple freelists per
relation. And with per-freelist metadata such as owner and original
leader backend XID values. Plus of course the lists of free blocks
themselves. The prototype already clearly fixes the worst problems
with BenchmarkSQL, but that's only one of my goals. That's just the
starting point.

I appreciate your input on this. And not just on the implementation
details -- the actual requirements themselves are still in flux. This
isn't so much a project to replace the FSM as it is a project that
adds a new rich abstraction layer that goes between access methods and
smgr.c -- free space management is only one of the responsibilities.

Access methods like heapam can talk to the new abstraction layer,
sometimes exploiting semantics from the logical database -- owning
transaction IDs for a given lease, things like that. The new
abstraction may thereby help us when working on other problems,
including problems in areas that might seem totally unrelated. That's
the big idea. Here are some (admittedly quite speculative) examples of
places that the infrastructure could help with:

* Parallel INSERT/COPY.

I imagine parallel DML is hard because of coordination problems. With
the infrastructure I have in mind, a parallel DML implementation could
request a large free space lease, and distribute blocks from that
lease among parallel workers during execution, on demand. This
naturally avoids unnecessary buffer lock contention among concurrent
writes from the same transaction (but different workers).

But that's not all. I'm also thinking of things like LWLock/buffer
lock deadlock avoidance -- issues of fundamental correctness. That
stuff becomes easier, too. Individual parallel workers from a parallel
INSERT could safely assume that the blocks that the parallel leader
tells them about really is only going to be accessed by them, for as
long as the owning transaction is around (barring read-only access by
VACUUM, things like that). There is simply no question of any other
backend/parallel worker thinking that it has the right to use the same
block before the transaction commits.

The transaction-scoped ownership semantics must be truly robust for
this to work at all, but that's a good idea anyway.

* Truncate tables without a disruptive relation-level lock.

AFAICT it isn't actually logically necessary to get a heavyweight
relation extension lock to truncate a relation during VACUUM. The only
reason we need that right now is because it's really hard to nail down
which backends might have a usable BlockNumber that they still expect
to not point past the end of the physical relfilenode (or much worse,
point to new and logically unrelated data). Perhaps we can solve the
underlying problem using something like Lanin & Shasha's "drain
technique", which is used inside nbtree for page deletion today.

The general idea is to defer the physical unlink/recycling until we
know for sure that any backends that might have had unsafe references
have gone away. This is a very broad technique.

* Truncate-away empty heap table segments that happen to come before
other segments that still have useful data.

* Eager cleanup of aborted transactions by VACUUM (or some variation
of VACUUM that just handles aborts).

Again, we're using information about the logical database to solve
problem with the physical database -- that information is what we seem
to currently be missing in all these areas. That's really the central
idea behind the new infrastructure.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Mon, Aug 23, 2021 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Right now my prototype has a centralized table in shared memory, with
> a hash table. One entry per relation, generally multiple freelists per
> relation. And with per-freelist metadata such as owner and original
> leader backend XID values. Plus of course the lists of free blocks
> themselves. The prototype already clearly fixes the worst problems
> with BenchmarkSQL, but that's only one of my goals. That's just the
> starting point.
>
> I appreciate your input on this. And not just on the implementation
> details -- the actual requirements themselves are still in flux. This
> isn't so much a project to replace the FSM as it is a project that
> adds a new rich abstraction layer that goes between access methods and
> smgr.c -- free space management is only one of the responsibilities.

Makes sense. I think one of the big implementation challenges here is
coping with the scenario where there's not enough shared memory
available ... or else somehow making that impossible without reserving
an unreasonable amount of shared memory. If you allowed space for
every buffer to belong to a different relation and have the maximum
number of leases and whatever, you'd probably have no possibility of
OOM, but you'd probably be pre-reserving too much memory. I also think
there are some implementation challenges around locking. You probably
need some, because the data structure is shared, but because it's
complex, it's not easy to create locking that allows for good
concurrency. Or so I think.

Andres has been working -- I think for years now -- on replacing the
buffer mapping table with a radix tree of some kind. That strikes me
as very similar to what you're doing here. The per-relation data can
then include not only the kind of stuff you're talking about but very
fundamental things like how long it is and where its buffers are in
the buffer pool. Hopefully we don't end up with dueling patches.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Wed, Aug 25, 2021 at 10:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Makes sense.

I'm glad that the big picture stuff makes sense to you.

> I think one of the big implementation challenges here is
> coping with the scenario where there's not enough shared memory
> available ... or else somehow making that impossible without reserving
> an unreasonable amount of shared memory.

Yes, it'll definitely be necessary to nail that down.

> If you allowed space for
> every buffer to belong to a different relation and have the maximum
> number of leases and whatever, you'd probably have no possibility of
> OOM, but you'd probably be pre-reserving too much memory.

I hope that we can control the shared memory space overhead by making
it a function of max_connections, plus some configurable number of
relations that get modified within a single transaction. This approach
must behave in the same way when when the number of tables that each
transaction actually modifies is high -- perhaps a transaction that
does this then pays a penalty in WAL logging within the FSM. I think
that that can be made manageable, especially if we can pretty much
impose the cost directly on those transactions that need to modify
lots of relations all at once. (If we can reuse the shared memory over
time it'll help too.)

> I also think
> there are some implementation challenges around locking.

That seems likely.

> You probably
> need some, because the data structure is shared, but because it's
> complex, it's not easy to create locking that allows for good
> concurrency. Or so I think.

My hope is that this design more than makes up for it by relieving
contention in other areas. Like buffer lock contention, or relation
extension lock contention.

> Andres has been working -- I think for years now -- on replacing the
> buffer mapping table with a radix tree of some kind. That strikes me
> as very similar to what you're doing here. The per-relation data can
> then include not only the kind of stuff you're talking about but very
> fundamental things like how long it is and where its buffers are in
> the buffer pool. Hopefully we don't end up with dueling patches.

I agree that there is definitely some overlap. I see no risk of a real
conflict, though. I have mostly been approaching this project as an
effort to fix the locality problems, mostly by looking for fixes to
the BenchmarkSQL workload's problems. I have to admit that the big
picture stuff about exploiting transactional semantics with free space
management is still pretty aspirational. The resource management parts
of my prototype patch are by far the kludgiest parts.

I hope that I can benefit from whatever work Andres has already done
on this, particularly when it comes to managing per-relation metadata
in shared memory.

--
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This is actually quite simple -- the chaos that follows is where it
> gets truly complicated (and not necessarily worth getting into just
> yet). The update of a given order and its order lines entries takes
> place hours later, within a delivery transaction.

Update on this: aborted transactions play an important role in the
chaos that we see with BenchmarkSQL/TPC-C.

This is surprising, in one way. Only 1% of all transactions are
aborted (per the TPC-C spec). But in another way it's not surprising.
It's totally consistent with what I've been saying about open and
closed pages. When a page that's initially allocated and opened by a
backend becomes closed following some inserts, the page should be left
in a more or less pristine state -- owning backends that open and
close pages need to be "good citizens.

The definition of "pristine" must also include "no already-aborted
tuples" -- aborted xact tuples are a problem that we can and should
nip in the bud. This is really an extension of what I've been saying
about not disrupting naturally occuring temporal and spatial locality.
We're far better off if the space reused by freeing aborted heap
tuples is reused by the very next transaction running in the very same
backend. Plus it's not that hard; we can know for sure that this space
is immediately available, which is not generally true of pruning. And
so this free space from aborted xact tuples is quite special.

Even with the master branch the problems with the FSM are much less
noticeable once you configure BenchmarkSQL to not abort any
transactions [1]. You need to consider the whole picture to understand
how the issue of aborted transactions has such an outsized negative
impact on performance and space efficiency.

It's a whole *chain* of events:

* If just 1% of all transactions abort, that's enough to leave an
average of about 1 aborted tuple on every "order" table page (which
naturally has small heap tuples [2]), and about 10 on a minority of
the really big "new order" table's pages. Maybe 1 in 10 or 12 "order
line" table heap pages are affected in this way.

* The overall impact is that small (but not tiny) pockets of free
space are left randomly strewn all over the place. Not just LP_DEAD
line pointer stubs -- whole entire aborted heap tuples.

* In theory we could address this problem using opportunistic pruning.
But in practice that simply cannot happen today, because we don't
touch the inserted rows until literally hours later.

The same rows will eventually have updates and selects, but that won't
help -- too little, too late. That's just how opportunistic pruning
works: currently you need some kind of scan node for opportunistic
pruning to kick in, so continually inserting transactions (from new
orders) are currently fundamentally unable to do any opportunistic
pruning. Much less opportunistic pruning that kicks in at exactly the
right time.

* When an autovacuum worker runs against these tables, it will of
course notice this diffuse free space from aborted tuples, and put it
in the FSM.

That free space will be reused, but by totally unrelated logical rows.

What we really seem to need here is some rudimentary form of "eager
physical rollback"; the really important thing seems to be to not
allow the problem to happen in the first place. I have a basic
prototype implementation, built on top of the larger prototype patch.
It opportunistically prunes-away aborted heap tuples on target/open
pages, bypassing the usual pd_prune_xid check we have in
heap_page_prune_opt() -- that doesn't make sense with the new form of
targeted, abort-orientated pruning. The new pruning mechanism makes a
really big difference on its own, without even involving the FSM. The
best cure for all kinds of bloat is prevention, which this comes
pretty close to.

I think that this general idea can be pushed further. I've talked
about aborts, but what about the commit path? I don't see why we can't
make inserting backends responsible for setting hint bits on recently
inserted rows -- the cost of putting off that work can only go up, no
matter what (unless maybe the user drops the table). Again, the
missing piece seems to be a general sense of ownership of heap pages
by transactions and/or backends. In order for a backend to be able to
clean up its own mess after itself while it's still cheap, it has to
understand what that actually means.

Don't forget that the TPC-C workload doesn't have that many updates
(at least not for the tables I'm focussed on here). Nothing that we
tend to think of as an adversarial case for Postgres, except perhaps
the updates against the "order" table, which are always non-HOT
updates due to an issue with the indexes. That's the extent of it,
though -- every other table easily has 99%+ of all updates use HOT
(albeit with some heap fill factor tuning). We should *expect* long
term stability here, based on the particulars of the workload; we only
need to update every "order" row and every "order lines" row exactly
once. After that they can just age out of shared_buffers forever
(actually they might be read again, but never modified again, but
reads were never the problem). The big "order line" table is by far
the biggest problem, even though it manages to use HOT for practically
all updates.

[1] https://github.com/wieck/benchmarksql/commit/66b8db073545026dc76ef513d2b0e318d2f3d5a2
[2] https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf -- "Table 1:
Summary of Logical Database"
--
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Hannu Krosing
Date:
When I have been thinking of this type of problem it seems that the
latest -- and correct :) --  place which should do all kinds of
cleanup like removing aborted tuples, freezing committed tuples and
setting any needed hint bits would be background writer or CHECKPOINT.

This would be more PostgreSQL-like, as it moves any work not
immediately needed from the critical path, as an extension of how MVCC
for PostgreSQL works in general.

Another option could be one more background cleaner with "needs
cleanup" bitmap which is updated by "real work" backends to let the
cleaner know what needs to be done.

But doing it as part of checkpoint probably ends up with less WAL
writes in the end.

That said, CHECKPOINT can efficiently clean up only Heap pages, unless
we do some extra work to ensure buffercache eviction ordering so that
Heap is (almost) always cleaaned worst and has thus an option to also
clean index pointers which point to it.

There could be a possibility to do a small amount of cleanup -- enough
for TPC-C-like workloads, but not larger ones -- while waiting for the
next command to arrive from the client over the network. This of
course assumes that we will not improve our feeder mechanism to have
back-to-back incoming commands, which can already be done today, but
which I have seen seldom used.

Cheers,
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.






On Mon, Sep 6, 2021 at 11:59 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > This is actually quite simple -- the chaos that follows is where it
> > gets truly complicated (and not necessarily worth getting into just
> > yet). The update of a given order and its order lines entries takes
> > place hours later, within a delivery transaction.
>
> Update on this: aborted transactions play an important role in the
> chaos that we see with BenchmarkSQL/TPC-C.
>
> This is surprising, in one way. Only 1% of all transactions are
> aborted (per the TPC-C spec). But in another way it's not surprising.
> It's totally consistent with what I've been saying about open and
> closed pages. When a page that's initially allocated and opened by a
> backend becomes closed following some inserts, the page should be left
> in a more or less pristine state -- owning backends that open and
> close pages need to be "good citizens.
>
> The definition of "pristine" must also include "no already-aborted
> tuples" -- aborted xact tuples are a problem that we can and should
> nip in the bud. This is really an extension of what I've been saying
> about not disrupting naturally occuring temporal and spatial locality.
> We're far better off if the space reused by freeing aborted heap
> tuples is reused by the very next transaction running in the very same
> backend. Plus it's not that hard; we can know for sure that this space
> is immediately available, which is not generally true of pruning. And
> so this free space from aborted xact tuples is quite special.
>
> Even with the master branch the problems with the FSM are much less
> noticeable once you configure BenchmarkSQL to not abort any
> transactions [1]. You need to consider the whole picture to understand
> how the issue of aborted transactions has such an outsized negative
> impact on performance and space efficiency.
>
> It's a whole *chain* of events:
>
> * If just 1% of all transactions abort, that's enough to leave an
> average of about 1 aborted tuple on every "order" table page (which
> naturally has small heap tuples [2]), and about 10 on a minority of
> the really big "new order" table's pages. Maybe 1 in 10 or 12 "order
> line" table heap pages are affected in this way.
>
> * The overall impact is that small (but not tiny) pockets of free
> space are left randomly strewn all over the place. Not just LP_DEAD
> line pointer stubs -- whole entire aborted heap tuples.
>
> * In theory we could address this problem using opportunistic pruning.
> But in practice that simply cannot happen today, because we don't
> touch the inserted rows until literally hours later.
>
> The same rows will eventually have updates and selects, but that won't
> help -- too little, too late. That's just how opportunistic pruning
> works: currently you need some kind of scan node for opportunistic
> pruning to kick in, so continually inserting transactions (from new
> orders) are currently fundamentally unable to do any opportunistic
> pruning. Much less opportunistic pruning that kicks in at exactly the
> right time.
>
> * When an autovacuum worker runs against these tables, it will of
> course notice this diffuse free space from aborted tuples, and put it
> in the FSM.
>
> That free space will be reused, but by totally unrelated logical rows.
>
> What we really seem to need here is some rudimentary form of "eager
> physical rollback"; the really important thing seems to be to not
> allow the problem to happen in the first place. I have a basic
> prototype implementation, built on top of the larger prototype patch.
> It opportunistically prunes-away aborted heap tuples on target/open
> pages, bypassing the usual pd_prune_xid check we have in
> heap_page_prune_opt() -- that doesn't make sense with the new form of
> targeted, abort-orientated pruning. The new pruning mechanism makes a
> really big difference on its own, without even involving the FSM. The
> best cure for all kinds of bloat is prevention, which this comes
> pretty close to.
>
> I think that this general idea can be pushed further. I've talked
> about aborts, but what about the commit path? I don't see why we can't
> make inserting backends responsible for setting hint bits on recently
> inserted rows -- the cost of putting off that work can only go up, no
> matter what (unless maybe the user drops the table). Again, the
> missing piece seems to be a general sense of ownership of heap pages
> by transactions and/or backends. In order for a backend to be able to
> clean up its own mess after itself while it's still cheap, it has to
> understand what that actually means.
>
> Don't forget that the TPC-C workload doesn't have that many updates
> (at least not for the tables I'm focussed on here). Nothing that we
> tend to think of as an adversarial case for Postgres, except perhaps
> the updates against the "order" table, which are always non-HOT
> updates due to an issue with the indexes. That's the extent of it,
> though -- every other table easily has 99%+ of all updates use HOT
> (albeit with some heap fill factor tuning). We should *expect* long
> term stability here, based on the particulars of the workload; we only
> need to update every "order" row and every "order lines" row exactly
> once. After that they can just age out of shared_buffers forever
> (actually they might be read again, but never modified again, but
> reads were never the problem). The big "order line" table is by far
> the biggest problem, even though it manages to use HOT for practically
> all updates.
>
> [1] https://github.com/wieck/benchmarksql/commit/66b8db073545026dc76ef513d2b0e318d2f3d5a2
> [2] https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf -- "Table 1:
> Summary of Logical Database"
> --
> Peter Geoghegan
>
>



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Mon, Sep 6, 2021 at 4:33 PM Hannu Krosing <hannuk@google.com> wrote:
> When I have been thinking of this type of problem it seems that the
> latest -- and correct :) --  place which should do all kinds of
> cleanup like removing aborted tuples, freezing committed tuples and
> setting any needed hint bits would be background writer or CHECKPOINT.
>
> This would be more PostgreSQL-like, as it moves any work not
> immediately needed from the critical path, as an extension of how MVCC
> for PostgreSQL works in general.

I think it depends. There is no need to do work in the background
here, with TPC-C. With my patch series each backend can know that it
just had an aborted transaction that affected a page that it more or
less still owns. And has very close at hand, for further inserts. It's
very easy to piggy-back the work once you have that sense of ownership
of newly allocated heap pages by individual backends/transactions.

> This would be more PostgreSQL-like, as it moves any work not
> immediately needed from the critical path, as an extension of how MVCC
> for PostgreSQL works in general.

I think that it also makes sense to have what I've called "eager
physical rollback" that runs in the background, as you suggest.

I'm thinking of a specialized form of VACUUM that targets a specific
aborted transaction's known-dirtied pages. That's my long term goal,
actually. Originally I wanted to do this as a way of getting rid of
SLRUs and tuple freezing, by representing that all heap pages must
only have committed tuples implicitly. That seemed like a good enough
reason to split VACUUM into specialized "eager physical rollback
following abort" and "garbage collection" variants.

The insight that making abort-related cleanup special will help free
space management is totally new to me -- it emerged from working
directly on this benchmark. But it nicely complements some of my
existing ideas about improving VACUUM.

> But doing it as part of checkpoint probably ends up with less WAL
> writes in the end.

I don't think that checkpoints are special in any way. They're very
important in determining the total number of FPIs we'll generate, and
so have huge importance today. But that seems accidental to me.

> There could be a possibility to do a small amount of cleanup -- enough
> for TPC-C-like workloads, but not larger ones -- while waiting for the
> next command to arrive from the client over the network. This of
> course assumes that we will not improve our feeder mechanism to have
> back-to-back incoming commands, which can already be done today, but
> which I have seen seldom used.

That's what I meant, really. Doing the work of cleaning up a heap page
that a transaction inserts into (say pruning away aborted tuples or
setting hint bits) should ideally happen right after commit or abort
-- at least for OLTP like workloads, which are the common case for
Postgres. This cleanup doesn't have to be done by exactly the same
transactions (and can't be in most interesting cases). It should be
quite possible for the work to be done by approximately the same
transaction, though -- the current transaction cleans up inserts made
by the previous (now committed/aborted) transaction in the same
backend (for the same table).

The work of setting hint bits and pruning-away aborted heap tuples has
to be treated as a logical part of the cost of inserting heap tuples
-- backends pay this cost directly. At least with workloads where
transactions naturally only insert a handful of rows each in almost
all cases -- very much the common case.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Thomas Munro
Date:
On Thu, Aug 26, 2021 at 5:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Andres has been working -- I think for years now -- on replacing the
> buffer mapping table with a radix tree of some kind. That strikes me
> as very similar to what you're doing here. The per-relation data can
> then include not only the kind of stuff you're talking about but very
> fundamental things like how long it is and where its buffers are in
> the buffer pool. Hopefully we don't end up with dueling patches.

FTR I have a patch in development that adds a demand-paged (though
Anastasia recently suggested reconsidering that) per-SMGR relation
shmem object pool that initially tracks precisely "how long it is",
but I hope it would provide a hook to hang many other things on in
future that need coordination at the relation level, such as
synchronized scan position, and I hope, radix-based buffer mappings.
Could be relevant?   https://commitfest.postgresql.org/34/2933/ (I'll
rebase it soon).



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Mon, Sep 6, 2021 at 7:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> FTR I have a patch in development that adds a demand-paged (though
> Anastasia recently suggested reconsidering that) per-SMGR relation
> shmem object pool that initially tracks precisely "how long it is",
> but I hope it would provide a hook to hang many other things on in
> future that need coordination at the relation level, such as
> synchronized scan position, and I hope, radix-based buffer mappings.

Cool!

> Could be relevant?   https://commitfest.postgresql.org/34/2933/ (I'll
> rebase it soon).

I think that it is relevant. You're coming at the same thing from
roughly the opposite direction, at least in my mind. But that seems
like it might be a very good thing, given the specifics. I think that
we can meet each other in the middle, without there being any
particular risk of overlapping work or conflicting requirements. The
resource management aspects of my own prototype are the least worked
out and shakiest parts overall, so this seems likely. You've probably
already solved numerous problems for me.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Hannu Krosing
Date:
On Tue, Sep 7, 2021 at 2:29 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Sep 6, 2021 at 4:33 PM Hannu Krosing <hannuk@google.com> wrote:
> > When I have been thinking of this type of problem it seems that the
> > latest -- and correct :) --  place which should do all kinds of
> > cleanup like removing aborted tuples, freezing committed tuples and
> > setting any needed hint bits would be background writer or CHECKPOINT.
> >
> > This would be more PostgreSQL-like, as it moves any work not
> > immediately needed from the critical path, as an extension of how MVCC
> > for PostgreSQL works in general.
>
> I think it depends. There is no need to do work in the background
> here, with TPC-C. With my patch series each backend can know that it
> just had an aborted transaction that affected a page that it more or
> less still owns. And has very close at hand, for further inserts. It's
> very easy to piggy-back the work once you have that sense of ownership
> of newly allocated heap pages by individual backends/transactions.

Are you speaking of just heap pages here or also index pages ?

It seems indeed easy for heap, but for index pages can be mixed up by
other parallel work, especially things like Serial Primary Keys .

Or are you expecting these to be kept in good-enoug shape by your
earlier index manager work ?

> > This would be more PostgreSQL-like, as it moves any work not
> > immediately needed from the critical path, as an extension of how MVCC
> > for PostgreSQL works in general.
>
> I think that it also makes sense to have what I've called "eager
> physical rollback" that runs in the background, as you suggest.
>
> I'm thinking of a specialized form of VACUUM that targets a specific
> aborted transaction's known-dirtied pages. That's my long term goal,
> actually. Originally I wanted to do this as a way of getting rid of
> SLRUs and tuple freezing, by representing that all heap pages must
> only have committed tuples implicitly. That seemed like a good enough
> reason to split VACUUM into specialized "eager physical rollback
> following abort" and "garbage collection" variants.
>
> The insight that making abort-related cleanup special will help free
> space management is totally new to me -- it emerged from working
> directly on this benchmark. But it nicely complements some of my
> existing ideas about improving VACUUM.

A minimal useful patch emerging from that understanding could be
something which just adds hysteresis to FSM management. (TBH, I
actually kind of expected some hysteresis to be there already, as it
is in my mental model of "how things should be done" for managing
almost any resource :) )

Adding hysteresis to FSM management can hopefully be done independent
of all the other stuff and also seems to be something that is
unobtrusive and non-controversial enough to fit in current release and
possibly be even back-ported .

> > But doing it as part of checkpoint probably ends up with less WAL
> > writes in the end.
>
> I don't think that checkpoints are special in any way. They're very
> important in determining the total number of FPIs we'll generate, and
> so have huge importance today. But that seems accidental to me.

I did not mean CHECKPOINT as a command, but more the concept of
writing back / un-dirtying the page. In this sense it *is* special
because it is the last point in time where you are guaranteed to have
the page available in buffercache and thus cheap to access for
modifications plus you will avoid a second full-page writeback because
of cleanup. Also you do not want to postpone the cleanup to actual
page eviction, as that is usually in the critical path for some user
query or command.

Of course this becomes most important for workloads where the active
working set is larger than fits in memory and this is not a typical
case for OLTP any more. But at least freezing the page before
write-out could have a very big impact on the need to freeze "old"
pages available only on disk and thus would be a cheap way to improve
the problems around running out of transaction ids.

> > There could be a possibility to do a small amount of cleanup -- enough
> > for TPC-C-like workloads, but not larger ones -- while waiting for the
> > next command to arrive from the client over the network. This of
> > course assumes that we will not improve our feeder mechanism to have
> > back-to-back incoming commands, which can already be done today, but
> > which I have seen seldom used.
>
> That's what I meant, really. Doing the work of cleaning up a heap page
> that a transaction inserts into (say pruning away aborted tuples or
> setting hint bits) should ideally happen right after commit or abort
> -- at least for OLTP like workloads, which are the common case for
> Postgres. This cleanup doesn't have to be done by exactly the same
> transactions (and can't be in most interesting cases). It should be
> quite possible for the work to be done by approximately the same
> transaction, though -- the current transaction cleans up inserts made
> by the previous (now committed/aborted) transaction in the same
> backend (for the same table).

Again, do I assume correctly that you are here mainly targeting heap
and not indexes ?

> The work of setting hint bits and pruning-away aborted heap tuples has
> to be treated as a logical part of the cost of inserting heap tuples
> -- backends pay this cost directly. At least with workloads where
> transactions naturally only insert a handful of rows each in almost
> all cases -- very much the common case.

I know that in general we are very reluctant to add threads to
postgresql, but this looks a very valid case for having a
"microthread" running on the same core as the DML as it will share all
the CPU caches and can thus be really cheap without being actually in
the critical path.

 Cheers,
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Mon, Sep 6, 2021 at 8:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Sep 6, 2021 at 4:33 PM Hannu Krosing <hannuk@google.com> wrote:
> > When I have been thinking of this type of problem it seems that the
> > latest -- and correct :) --  place which should do all kinds of
> > cleanup like removing aborted tuples, freezing committed tuples and
> > setting any needed hint bits would be background writer or CHECKPOINT.
> >
> > This would be more PostgreSQL-like, as it moves any work not
> > immediately needed from the critical path, as an extension of how MVCC
> > for PostgreSQL works in general.
>
> I think it depends. There is no need to do work in the background
> here, with TPC-C. With my patch series each backend can know that it
> just had an aborted transaction that affected a page that it more or
> less still owns. And has very close at hand, for further inserts. It's
> very easy to piggy-back the work once you have that sense of ownership
> of newly allocated heap pages by individual backends/transactions.

Doing work in the background has some advantages, though. In
particular, it has the possibly-large advantage of not slowing down
foreground work.

For me the key insight here is that HOT-pruning a heap page is a lot
cheaper if you do it before you write the page. Once you've written
the page, the eventual HOT-prune is going to need to dirty it, which
will cause it to be written again. If you prune before writing it the
first time, that cost is avoided. I'm not sure that it really matters
whether the space consumed by aborted tuples gets reused by "the very
next transaction" or, say, 10 transactions after that, or even 1000
transactions after that. If you wait for a million transactions,
you've quite possibly lost enough temporality to matter, but at 10 or
1000 that's much less likely. The exact threshold is fuzzy: every
moment you wait makes it less likely that you have sufficient
locality, but you can always find a workload where even a very long
wait is acceptable, and another one where even a tiny delay is
catastrophic, and it's hard to say what the "real world" looks like.

On the other hand, there's nothing fuzzy about the expense incurred by
writing the page before it's HOT-pruned. That is essentially certain
to incur an extra page write, except in the corner case where the
relation gets dropped or truncated before then. So I think you might
find that if you found a way to ensure that HOT-pruning -- and entry
into the FSM -- always happens for every heap page just before it's
written, if it hasn't already happened sooner and might be needed, you
might end up in a pretty good spot. It wouldn't even be ignoring
temporal locality, since at minimum dirty pages are written once per
checkpoint cycle.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Tue, Sep 7, 2021 at 12:31 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Doing work in the background has some advantages, though. In
> particular, it has the possibly-large advantage of not slowing down
> foreground work.

What I really like about the idea of doing the work in the foreground
(when it's clearly not going to be too much of a latency hit) is that
it would act as a natural form of backpressure. Sometimes slowing
things down is a good thing -- fundamentally inefficient and
unsustainable use of the system's resources ought to be slowed down by
imposing the cost on the transaction that performs the relevant insert
queries. To me this seems like a problem of negative externalities.
With small transactions we can pretty much know for sure that the
benefit of not leaving pages in good shape at the moment they become
"closed" is fixed and low, while the cost to the system as a whole
over time is potentially far higher. It seems almost unbounded, in a
way, so we're bound to lose in the long run, regardless of workload.
(The work can be piggybacked across transactions to make it more
efficient, but that's mostly an implementation detail.)

Updaters might actually be much less of a problem than inserters. They
already keep costs down by pruning opportunistically -- they at least
try. Most individual logical rows that are inserted have exactly zero
chance of ever being updated, but a ~100% chance of needing to have
hint bits set sooner or later. We know very little about workload
characteristics, but these are some of the few things that we really
do know for sure.

> For me the key insight here is that HOT-pruning a heap page is a lot
> cheaper if you do it before you write the page. Once you've written
> the page, the eventual HOT-prune is going to need to dirty it, which
> will cause it to be written again. If you prune before writing it the
> first time, that cost is avoided.

Right. The difficulty with applying that insight before now has been
making the prune operation take place at the level of whole pages
systematically. I believe that it's quite valuable to set all the hint
bits on a page as we "close it out" for the first time. But setting
hint bits for all of the heap tuples on the page except one or two is
pretty much useless. You might as well not bother at all.

It follows that concurrent inserters clobbering the same page
senselessly are more or less not okay if we're going to pursue this
technique. You might still have a concurrent update of one of the
inserted rows shortly after the original inserting transaction
commits, but before the page is closed -- that might make the idea not
work out for that one page (obviously it depends on the specifics).
That seems like something we can live with, though.

> I'm not sure that it really matters
> whether the space consumed by aborted tuples gets reused by "the very
> next transaction" or, say, 10 transactions after that, or even 1000
> transactions after that. If you wait for a million transactions,
> you've quite possibly lost enough temporality to matter, but at 10 or
> 1000 that's much less likely.

I didn't mean to suggest that it had to happen in perfect lockstep.
But I do think that it should be pretty close to perfect. What I
actually do right now is prune an open page when it *appears* to be
full inside the loop in RelationGetBufferForTuple(). As crude as that
is, it sets hint bits and prunes away aborted transactions in mostly
the right way, at least as far as TPC-C is concerned. (This only works
because it fits into the wider framework of my much larger patch,
which introduces ownership semantics, open and closed pages, empty
page freelists, etc.)

> The exact threshold is fuzzy: every
> moment you wait makes it less likely that you have sufficient
> locality, but you can always find a workload where even a very long
> wait is acceptable, and another one where even a tiny delay is
> catastrophic, and it's hard to say what the "real world" looks like.

I agree that offloading work in the background can be very useful, and
that technique needs to be possible as future work (I actually really
want this myself). But most individual apps probably won't really
notice the latency hit from pruning if it's well managed -- as with
pruning by updates today. We can make a decision about which strategy
to use (lightweight foreground processing with a potential latency
hit, or background batch cleanup?) at a very late point in the cycle,
based on the specifics in each case. Offhand, I don't think that we
need to characterize this fuzzy threshold in too much detail. I
suspect that the main determinative question can be simple enough.
Something like: Does this transaction (or this group of transactions)
involve only trivial changes to maybe 3 - 5 heap pages at the most, or
not?

There is a separate question about what happens if our background
cleanup process cannot keep up over time. I think that we can handle
that by falling back on foreground processing (within reason). Let's
say we have transactions that each insert on to dozens or perhaps even
hundreds of heap pages each -- these are "medium sized'' write
transactions. If we're already doing way too much background
processing for similar transactions from earlier on, and if it's clear
we just can't keep up, maybe we can fall back and force foreground
processing, to keep the debt under control. (In general I think that
this back pressure, negative externalities business will be
important.)

It seems to me that this leaves one harder question unanswered: at
what point does a "medium sized" transaction become so large that it
just doesn't make sense to do either? What's the crossover point at
which background processing and foreground processing like this should
be assumed to be not worth it? That I won't speculate about just yet.
I suspect that at some point it really does make sense to leave it all
up to a true table-level batch operation, like a conventional VACUUM.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Tue, Sep 7, 2021 at 5:25 AM Hannu Krosing <hannuk@google.com> wrote:
> Are you speaking of just heap pages here or also index pages ?

Mostly heap pages, but FWIW I think it could work for index tuples
too, with retail index tuple deletion. Because that allows you to even
remove would-be LP_DEAD item pointers.

> Or are you expecting these to be kept in good-enoug shape by your
> earlier index manager work ?

It's very workload dependent. Some things were very much improved by
bottom-up index deletion in Postgres 14, for example (non-hot updates
with lots of logically unchanged indexes). Other things weren't helped
at all, or were barely helped. I think it's important to cover or
cases.

> A minimal useful patch emerging from that understanding could be
> something which just adds hysteresis to FSM management. (TBH, I
> actually kind of expected some hysteresis to be there already, as it
> is in my mental model of "how things should be done" for managing
> almost any resource :) )

I think that you need to do the FSM before the aborted-heap-tuple
cleanup. Otherwise you don't really know when or where to apply the
special kind of pruning that the patch invents, which targets aborts
specifically.

> Adding hysteresis to FSM management can hopefully be done independent
> of all the other stuff and also seems to be something that is
> unobtrusive and non-controversial enough to fit in current release and
> possibly be even back-ported .

I don't know about that! Seems kind of an invasive patch to me.

> I did not mean CHECKPOINT as a command, but more the concept of
> writing back / un-dirtying the page. In this sense it *is* special
> because it is the last point in time where you are guaranteed to have
> the page available in buffercache and thus cheap to access for
> modifications plus you will avoid a second full-page writeback because
> of cleanup. Also you do not want to postpone the cleanup to actual
> page eviction, as that is usually in the critical path for some user
> query or command.

I intend to do some kind of batching, but only at the level of small
groups of related transactions. Writing a page that was quickly opened
for new inserts, filled with newly inserted heap tuples, then closed,
and finally cleaned up doesn't seem like it needs to take any direct
responsibility for writeback.

-- 
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Hannu Krosing
Date:
On Wed, Sep 8, 2021 at 6:52 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Sep 7, 2021 at 5:25 AM Hannu Krosing <hannuk@google.com> wrote:
> > Are you speaking of just heap pages here or also index pages ?
>
> Mostly heap pages, but FWIW I think it could work for index tuples
> too, with retail index tuple deletion. Because that allows you to even
> remove would-be LP_DEAD item pointers.

But it does need info from the heap page(s) the index entry points to,
which can become quite expensive if that heap page is locked by other
processes or even not cached at all.

Also, parallel processes may have moved the index entry to other pages etc.

> > Or are you expecting these to be kept in good-enoug shape by your
> > earlier index manager work ?
>
> It's very workload dependent. Some things were very much improved by
> bottom-up index deletion in Postgres 14, for example (non-hot updates
> with lots of logically unchanged indexes). Other things weren't helped
> at all, or were barely helped. I think it's important to cover or
> cases.
>
> > A minimal useful patch emerging from that understanding could be
> > something which just adds hysteresis to FSM management. (TBH, I
> > actually kind of expected some hysteresis to be there already, as it
> > is in my mental model of "how things should be done" for managing
> > almost any resource :) )
>
> I think that you need to do the FSM before the aborted-heap-tuple
> cleanup. Otherwise you don't really know when or where to apply the
> special kind of pruning that the patch invents, which targets aborts
> specifically.

Alternative is to not actually prune the page at all at this time (to
avoid re-dirtying it and/or WAL write) but just inspect and add the
*potential* free space in FSM.
And then do the pruning at the time of next INSERT or UPDATE when the
page is ditied and WAL written anyway.

> > Adding hysteresis to FSM management can hopefully be done independent
> > of all the other stuff and also seems to be something that is
> > unobtrusive and non-controversial enough to fit in current release and
> > possibly be even back-ported .
>
> I don't know about that! Seems kind of an invasive patch to me.

Why do you think that changing the calculation when a page is added
back to FSM is "kind of invasive" ?

Not having looked at the code I would assume from the discussion here
that we remove the page from FSM when free space goes above FILLFACTOR
and we add it back when it goes back below FILLFACTOR.

The code change would just change the "add id back" code to use
FILLFACTOR - HYSTERESIS_FACTOR instead.

What other parts does this need to touch ? Or is it just that the
calculation source code is not localised well in code and written
multiple times all over the place ?

> > I did not mean CHECKPOINT as a command, but more the concept of
> > writing back / un-dirtying the page. In this sense it *is* special
> > because it is the last point in time where you are guaranteed to have
> > the page available in buffercache and thus cheap to access for
> > modifications plus you will avoid a second full-page writeback because
> > of cleanup. Also you do not want to postpone the cleanup to actual
> > page eviction, as that is usually in the critical path for some user
> > query or command.
>
> I intend to do some kind of batching, but only at the level of small
> groups of related transactions. Writing a page that was quickly opened
> for new inserts, filled with newly inserted heap tuples, then closed,
> and finally cleaned up doesn't seem like it needs to take any direct
> responsibility for writeback.

Agreed that this code does not need to care about writeback.

I approached it from the other end and proposed the "just before
writeback" as a strategy which is easy to understand conceptually and
also is mostly "already there" codewise, that is there is a location
in write-back code to plug the call to cleanup in .

Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Tue, Sep 7, 2021 at 9:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> What I really like about the idea of doing the work in the foreground
> (when it's clearly not going to be too much of a latency hit) is that
> it would act as a natural form of backpressure.

Right, that's fair. But, prune-before-evict can be done in the first
instance by the bgwriter and then by the foreground process doing the
eviction if the bgwriter doesn't keep up. At least, assuming we had a
bgwriter that worked, which I think right now we don't, but fixing
that might be a good idea anyway.

> I didn't mean to suggest that it had to happen in perfect lockstep.
> But I do think that it should be pretty close to perfect. What I
> actually do right now is prune an open page when it *appears* to be
> full inside the loop in RelationGetBufferForTuple().

That seems like a good idea.

> As crude as that
> is, it sets hint bits and prunes away aborted transactions in mostly
> the right way, at least as far as TPC-C is concerned. (This only works
> because it fits into the wider framework of my much larger patch,
> which introduces ownership semantics, open and closed pages, empty
> page freelists, etc.)

I don't know, I'm not really convinced that "much larger patches" that
change a lot of loosely related things all at once are good for the
project. It seems to me that there's a reasonably good chance of
replacing an annoying set of problems that existing PostgreSQL users
have worked around to some degree, knowing or unknowingly, with a
different annoying set of problems that may cause fewer or more
problems in practice. Sometimes there's no way to improve something
short of a giant project that changes a lot of things at the same
time, but a series of incremental changes is a lot less risky.

> It seems to me that this leaves one harder question unanswered: at
> what point does a "medium sized" transaction become so large that it
> just doesn't make sense to do either? What's the crossover point at
> which background processing and foreground processing like this should
> be assumed to be not worth it? That I won't speculate about just yet.
> I suspect that at some point it really does make sense to leave it all
> up to a true table-level batch operation, like a conventional VACUUM.

I doubt it makes sense to define a limit here explicitly. At some
point strategies will naturally start to fail, e.g. prune-before-evict
won't work once the operation becomes large enough that pages have to
be evicted while the transaction is still running.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Wed, Sep 8, 2021 at 8:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I didn't mean to suggest that it had to happen in perfect lockstep.
> > But I do think that it should be pretty close to perfect. What I
> > actually do right now is prune an open page when it *appears* to be
> > full inside the loop in RelationGetBufferForTuple().
>
> That seems like a good idea.

> I don't know, I'm not really convinced that "much larger patches" that
> change a lot of loosely related things all at once are good for the
> project. It seems to me that there's a reasonably good chance of
> replacing an annoying set of problems that existing PostgreSQL users
> have worked around to some degree, knowing or unknowingly, with a
> different annoying set of problems that may cause fewer or more
> problems in practice. Sometimes there's no way to improve something
> short of a giant project that changes a lot of things at the same
> time, but a series of incremental changes is a lot less risky.

But these things are *highly* related.

The RelationGetBufferForTuple() prune mechanism I described (that
targets aborted xact tuples and sets hint bits) is fundamentally built
on top of the idea of ownership of heap pages by backends/transactions
-- that was what I meant before. We *need* to have context. This isn't an
ordinary heap prune -- it doesn't have any of the prechecks to avoid
useless pruning that you see at the top of heap_page_prune_opt(). It's
possible that we won't be able to get a super-exclusive lock in the
specialized prune code path, but that's considered a rare corner case.
There is no question of concurrent inserters senselessly blocking the
prune, which is not at all true with the current approach to free
space management. So there is no way I could extract a minimal "prune
inside RelationGetBufferForTuple()" patch that would actually work.

Systems that follow ARIES closely and have UNDO *must* treat free
space as a qualitative thing, something that is meaningful only with
associated information about a deleting or inserting transaction, and
its status. There is logical UNDO for the free space management
structure, and even getting free space from a page can involve
heavyweight locking. Postgres works differently, but there is no
reason why Postgres should not do a lightweight approximate version of
the same thing - the laws of physics favor carefully grouping
logically related data, and working to keep the physical database
representation as clean a representation of the logical database as
possible, right from the start.

> > It seems to me that this leaves one harder question unanswered: at
> > what point does a "medium sized" transaction become so large that it
> > just doesn't make sense to do either? What's the crossover point at
> > which background processing and foreground processing like this should
> > be assumed to be not worth it? That I won't speculate about just yet.
> > I suspect that at some point it really does make sense to leave it all
> > up to a true table-level batch operation, like a conventional VACUUM.
>
> I doubt it makes sense to define a limit here explicitly. At some
> point strategies will naturally start to fail, e.g. prune-before-evict
> won't work once the operation becomes large enough that pages have to
> be evicted while the transaction is still running.

Perhaps. As you know I'm generally in favor of letting things fail
naturally, and then falling back on an alternative strategy.


--
Peter Geoghegan



Re: The Free Space Map: Problems and Opportunities

From
Robert Haas
Date:
On Wed, Sep 8, 2021 at 12:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> But these things are *highly* related.
>
> The RelationGetBufferForTuple() prune mechanism I described (that
> targets aborted xact tuples and sets hint bits) is fundamentally built
> on top of the idea of ownership of heap pages by backends/transactions
> -- that was what I meant before. We *need* to have context. This isn't an
> ordinary heap prune -- it doesn't have any of the prechecks to avoid
> useless pruning that you see at the top of heap_page_prune_opt(). It's
> possible that we won't be able to get a super-exclusive lock in the
> specialized prune code path, but that's considered a rare corner case.
> There is no question of concurrent inserters senselessly blocking the
> prune, which is not at all true with the current approach to free
> space management. So there is no way I could extract a minimal "prune
> inside RelationGetBufferForTuple()" patch that would actually work.

I'm not trying to argue for slimming down your patches to a size that
is so small that they no longer work.

However, I *am* arguing that, like bottom-up index deletion and the
emergency vacuum mechanism, this change sounds like something that
could *easily* have unforeseen consequences. And therefore a lot of
caution is needed. And part of that caution is not changing more
things at the same time than is really necessary. And that it's worth
thinking *hard* about how much change is *really* necessary.

It's very easy to convince oneself that everything is connected to
everything else and therefore we have to change a lot of things all at
once, but that's often not true.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: The Free Space Map: Problems and Opportunities

From
Peter Geoghegan
Date:
On Wed, Sep 8, 2021 at 9:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not trying to argue for slimming down your patches to a size that
> is so small that they no longer work.

I was just explaining how the "eager physical rollback by pruning" thing works.

> However, I *am* arguing that, like bottom-up index deletion and the
> emergency vacuum mechanism, this change sounds like something that
> could *easily* have unforeseen consequences. And therefore a lot of
> caution is needed.

I agree that a lot of caution is needed. One can justify aggressive
intervention when the likely alternative is a severe and irreversible
adverse event, like a version-driven page split. On the other hand we
really should be very conservative when that isn't what we see.
Context is very important.

> And part of that caution is not changing more
> things at the same time than is really necessary. And that it's worth
> thinking *hard* about how much change is *really* necessary.

For the record bottom-up index deletion had very little code in the
grand scheme of things. Without documentation changes and comments it
is probably less than 1000 lines of C. And the emergency mechanism
(considered separately from the refactoring work) had very little
code. So I don't know what you mean about changing lots of things all
together at once. Huge big-bang style patches really aren't my style
at all. Don't expect that to change.

> It's very easy to convince oneself that everything is connected to
> everything else and therefore we have to change a lot of things all at
> once, but that's often not true.

There is compelling empirical evidence for my "chain of events"
explanation. You can make things much better for BenchmarkSQL on the
master branch today, either by artificially disabling the FSM
altogether (which hurts a little but helps a lot), *or* by configuring
the benchmark to not abort transactions in the first place (which
effectively simulates eager physical rollback of aborted xacts).

I don't want to change very much. Despite everything I've said, I
think it's true that the system does the right thing in most
situations -- even with BenchmarkSQL/TPC-C! I just find it useful to
focus on the exceptions, the cases where the system clearly does the
wrong thing.

-- 
Peter Geoghegan