Re: The Free Space Map: Problems and Opportunities - Mailing list pgsql-hackers

From Andres Freund
Subject Re: The Free Space Map: Problems and Opportunities
Date
Msg-id 20210817131751.wcq2xkevril2wxwv@alap3.anarazel.de
Whole thread Raw
In response to The Free Space Map: Problems and Opportunities  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: The Free Space Map: Problems and Opportunities
Re: The Free Space Map: Problems and Opportunities
List pgsql-hackers
Hi,

On 2021-08-16 10:35:45 -0700, Peter Geoghegan wrote:
> Problems
> --------
>
> The FSM gives out space without considering the passage of time, or
> the likelihood that a particular transaction or client will consume
> its requested free space in a more or less predictable and steady way
> over time. It has no memory. The FSM therefore fails to capture
> naturally occurring locality, or at least doesn't specifically care
> about it. This is the central problem, that all other problems with
> the FSM seem to stem from in one way or another.

I suspect that one other fairly fundamental issue is that we are very
reticent updating the FSM with information about free space. As long as
that's the case a smarter placement logic would often not have a better
place to choose from.


> Background
> ----------
>
> To recap, the FSM tracks how much free space is in every heap page.
> Most FSM maintenance takes place during VACUUM, though ordinary
> connection backends occasionally help to keep the FSM current, via
> calls to RecordAndGetPageWithFreeSpace() made from hio.c.
>
> There is also a bulk extension mechanism added by commit 719c84c1be,
> which is used when we detect multiple concurrent inserters. This bulk
> extension mechanism does change things a little (it gives some thought
> to systematic effects), though it still has the same basic issues: it
> doesn't do nearly enough to group likely-related rows together. I
> suspect that improving the bulk extension mechanism will be an
> important part of improving the overall picture.

I think the design of the current bulk extension mechanism is quite backward
(it only takes contention into account not other needs for bulk work, it does
expensive stuff like finding victim buffers under lock, it doesn't release
locks until all blocks are extended, ...).  But: I don't think a bulk
extension mechanism should be tasked with doing grouping or anything like
that.


> Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
> ever end up inserting *any* new logical rows on a heap page after the
> page first fills -- it is initially "open" when first allocated, and
> then quickly becomes "closed" to inserts of new logical rows, once it
> crosses a certain almost-empty space threshold. The heap page usually
> stays "closed" forever. While the design does not *completely* ignore
> the possibility that the page won't eventually get so empty that reuse
> really does look like a good idea, it makes it rare. A heap page needs
> to have rather a lot of the original logical rows deleted before that
> becomes possible again. It's rare for it to go backwards because the
> threshold that makes a closed page become open again is much lower
> (i.e. involves much more free space) than the threshold that initially
> made a (newly allocated) heap page closed. This one-way stickiness
> quality is important.

I suspect that we'd get a *LOT* of complaints if we introduced that degree of
stickiness.


> This stickiness concept is called "hysteresis" by some DB researchers,
> often when discussing UNDO stuff [8]. Having *far less* granularity
> than FSM_CATEGORIES/255 seems essential to make that work as intended.
> Pages need to be able to settle without being disturbed by noise-level
> differences. That's all that super fine grained values buy you: more
> noise, more confusion.

Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
from a fairly granular FSM_CATEGORIES, while still achieving better
locality. You could e.g. search for free space for an out-of-page update in
nearby pages with a lower free-percentage threshold, while having a different
threshold and selection criteria for placement of inserts.


> Visibility map
> --------------
>
> If the logical database and natural locality are important to the FSM,
> then what about the visibility map? And what about the relationship
> between the FSM and the visibility map, in light of all this?
>
> Currently VACUUM doesn't care about how its FSM behavior interacts
> with how it sets all-frozen/all-visible bits for the same heap page.
> To me this seems completely unreasonable -- they're *obviously*
> related! We're probably only gaining a minimal amount of free space on
> one occasion by ignoring the VM/FSM relationship, for which we pay a
> high cost. Worst of all we're *perpetuating the cycle* of dirtying and
> redirtying the same pages over time. Maybe we should go as far as
> merging the FSM and VM, even -- that seems like a natural consequence
> of logical-ish/qualitative definitions of "page is full".

The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
merging FSM and VM is a good direction to go to. Additionally we need to make
sure that the VM is accurately durable, which isn't the case for the FSM
(which I think we should use to maintain it more accurately).

But perhaps I'm just missing something here. What is the strong interlink
between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM
that the main linkage is on the side of inserts/update that choose a target
page from the FSM. Isn't it better to make that side smarter?



> Imagine if the FSM tracked recent free space requests that have been
> satisfied already. Requesting backends will maintain this same
> metadata when they burn through their bulk allocation of reserved
> space/heap pages. This new structure would contain metadata like the
> requesting XID/backend for a given lease of heap pages, the number of
> leases during the last bulk extension operation, and maybe the number
> of concurrently waiting backends at that time. Now we have some sense
> of recent history, and start to recognize trends over time. The FSM is
> now able to *take back* some of the space that it gives out, based on
> new information. Now the FSM can afford to make mistakes because the
> mistakes can always be corrected a little later. The FSM can be
> generous based on speculation, heuristics, whatever it might be. It
> should be possible for the FSM to have it both ways, more or less.

Hm. To me this seems like a somewhat separate layer from the FSM. Having that
kind of data go through shared buffers, with a need to worry about on-disk
consistency/cleanup etc, the inability to add granular locks on a sub-page
level, the need to go through buffer mapping / pinning all seem like things
you wouldn't actually want.

To me this seems like it'd be better addressed by a shared, per-relfilenode,
in-memory datastructure. Thomas Munro has been working on keeping accurate
per-relfilenode relation size information. ISTM that that that'd be a better
place to hook in for this.


> would be quite natural for the leader backend to say to each individual
> follower backend: thank you for waiting for me, here is a big contiguous
> range of heap pages (e.g., as many as 50 heap pages or more per xact when
> things truly ramp up), which should be enough to allow you to avoid going to
> the FSM again for quite a long time.

That'd end up badly if we did for a relation that's hotly inserted into by a
lot of connections, but where each connection only inserts a few rows. I
suspect we'd either need information from the inserting backends about the
number of pages they're likely going to need (e.g. a smarter "extension queue"
where each entry lists the number of pages requested) and/or stats about the
number of consecutive insertions a relation typically gets.


> But the leader backend doesn't do that at all -- it just puts all of
> the pages (up to 512 total) in the FSM. All of the waiting follower
> backends are left to senselessly fight it out for free space by simply
> going back to the FSM when they wake up again. Why can't they
> cooperate intelligently? What stops that today?

The lack of suitable in-memory coordination facilities.

It turns out that I looked a bunch at this area because of AIO + DIO: With DIO
the typical linux filesystems (ext4, xfs) do not perform any delayed
allocation, which means that the non-bulk extension leads to *horrible*
filesystem fragmentation. And even the bulk paths don't allocate enough blocks
at once to avoid fragmentation with common amounts of concurrency.  But just
increasing the degree of bulk extension isn't helpful - it just exacerbates
already bad thundering herd issues: All backends will decide to try to extend
at the same time, all-1 backends will wait, one backend will extend wake up
all-1, they then contend for the same locks etc.

I addressed (or rather sidestepped) this to a small degree in the AIO+DIO
patchset, by making extension faster and doing a few improvements around
locking. But it's not good enough.

If the relation extension infrastructure instead had information about the
number of blocks each waiter wants and we had Thomas' "accurate shared
relation size" infrastructure, we could do a lot better:

1) Whenever a backend needs new space and most of the previously bulk-extended
   space is used opportunistically try to bulk-extend. In the happy path this
   avoids the thundering herd issue because other backends can continue
   inserting while extension is in progress.

2) Whenever bulk extending, release pages to individual backends by looking in
   the queue of backends needing extension and wake up backends individually
   whenever we've bulk extended sufficiently for that backend.

   This provides two main benefits: First, it avoids all backends trying to
   acquire exactly the same resources just after waking up, without being able
   to benefit (because the first backend will often just fill the
   page). Second, it improves average latency during extension substantially,
   because individual backends can start continuing as soon as bulk extension
   has progressed enough for some backend, rather than needing to wait for the
   entire extension.



> FSM "rollback"
> -------------
>
> (This is where my argument comes together, finally.)
>
> FSM rollback in the FSIP paper is really just about reliable ownership
> semantics. Ownership of free space in heap pages that is scoped at the
> level of transactions.
>
> It seems to me that ownership of free space in heap pages by
> particular transactions is really all that the bulk extension FSM
> logic is missing -- and getting that right seems like a big part of
> fixing the FSM comprehensively. The bulk extension mechanism cannot
> *trust* the backends to use more than a little space at a time today.
> While on average each backend's use of free space probably *is*
> predictable over time, it only takes one or two straggler backends per
> bulk operation to mess everything up -- that's enough for the system
> to end up with very poor space utilization. Since we don't have any
> sense of ownership of space in heap pages by transactions today, the
> code must hedge against stragglers/space leaks by making the space
> *generally* available to all. Of course, this hedging comes at a high
> cost: it prevents the whole system (heapam, FSM, and backends) from
> capturing naturally occurring locality in a variety of important
> cases. Including the important TPC-C case I mentioned.

I think the fundamental issue of bulk extension issue is more that bulk
extension shouldn't actually get pages from the FSM at all. That's just an
ugly implementation detail of the hack we have currently, rather than
something that should live on.  We may want to decide to continue inserting
bulk extended pages into the FSM to handle edge cases of backends erroring out
before space is used, but the happy path should not touch the FSM at all.


> Explicit scoped ownership of free space in heap pages removes the need
> to hedge. It addresses the problem of would-be space leaks directly,
> while *also* fixing the FSM locality problems implicitly. Metadata
> about recent requests gives us the *context* needed to do all this
> well.

It sounds like you're thinking of using some global history, or at least a
backend local history? I think it might be better to have backend local,
per-command state. E.g. for a COPY aggressively increase the amount of space
reserved, but not for a single-row insert.



> I now wonder if the FSM is fundamentally doing the wrong thing by
> keeping track of all "free space" in every page over time. Why
> wouldn't we just have a free space management strategy that is
> generally only concerned with recent events?

I guess it depends on what you define as recent, but I don't see a way to
bound the timeframe usefully. We need to deal with cases where table sizes go
up and down over time, e.g. because there's periodic batch activity and
ongoing activity. What's the time bound for unused space somewhere in a
relation? I don't see any, unless we add compaction logic to vacuum.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Ajin Cherian
Date:
Subject: Re: logical replication empty transactions
Next
From: Greg Nancarrow
Date:
Subject: Re: Fix uninitialized variable access (src/backend/utils/mmgr/freepage.c)