Re: The Free Space Map: Problems and Opportunities - Mailing list pgsql-hackers

From Robert Haas
Subject Re: The Free Space Map: Problems and Opportunities
Date
Msg-id CA+TgmoaHWLXkwiXqyAEXPbfRuOKrHtMtsV1Sr-77nTzhHJkLXg@mail.gmail.com
Whole thread Raw
In response to Re: The Free Space Map: Problems and Opportunities  (Andres Freund <andres@anarazel.de>)
Responses Re: The Free Space Map: Problems and Opportunities
List pgsql-hackers
On Tue, Aug 17, 2021 at 9:18 AM Andres Freund <andres@anarazel.de> wrote:
> > Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
> > ever end up inserting *any* new logical rows on a heap page after the
> > page first fills -- it is initially "open" when first allocated, and
> > then quickly becomes "closed" to inserts of new logical rows, once it
> > crosses a certain almost-empty space threshold. The heap page usually
> > stays "closed" forever. While the design does not *completely* ignore
> > the possibility that the page won't eventually get so empty that reuse
> > really does look like a good idea, it makes it rare. A heap page needs
> > to have rather a lot of the original logical rows deleted before that
> > becomes possible again. It's rare for it to go backwards because the
> > threshold that makes a closed page become open again is much lower
> > (i.e. involves much more free space) than the threshold that initially
> > made a (newly allocated) heap page closed. This one-way stickiness
> > quality is important.
>
> I suspect that we'd get a *LOT* of complaints if we introduced that degree of
> stickiness.

I don't know what the right degree of stickiness is, but I can easily
believe that we want to have more than none. Part of Peter's point, at
least as I understand it, is that if a page has 100 tuples and you
delete 1 or 2 or 3 of them, it is not smart to fill up the space thus
created with 1 or 2 or 3 new tuples. You should instead hope that
space will optimize future HOT updates, or else wait for the page to
lose some larger number of tuples and then fill up all the space at
the same time with tuples that are, hopefully, related to each other
in some useful way. Now what's the threshold? 20 out of 100? 50? 80?
That's not really clear to me. But it's probably not 1 out of 100.

> > This stickiness concept is called "hysteresis" by some DB researchers,
> > often when discussing UNDO stuff [8]. Having *far less* granularity
> > than FSM_CATEGORIES/255 seems essential to make that work as intended.
> > Pages need to be able to settle without being disturbed by noise-level
> > differences. That's all that super fine grained values buy you: more
> > noise, more confusion.
>
> Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
> from a fairly granular FSM_CATEGORIES, while still achieving better
> locality. You could e.g. search for free space for an out-of-page update in
> nearby pages with a lower free-percentage threshold, while having a different
> threshold and selection criteria for placement of inserts.

Yeah, I don't think that reducing FSM_CATEGORIES is likely to have
much value for its own sake. But it might be useful as a way of
accomplishing some other goal. For example if we decided that we need
a bit per page to track "open" vs. "closed" status or something of
that sort, I don't think we'd lose much by having fewer categories.

> The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
> merging FSM and VM is a good direction to go to. Additionally we need to make
> sure that the VM is accurately durable, which isn't the case for the FSM
> (which I think we should use to maintain it more accurately).
>
> But perhaps I'm just missing something here. What is the strong interlink
> between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM
> that the main linkage is on the side of inserts/update that choose a target
> page from the FSM. Isn't it better to make that side smarter?

I don't have a well-formed opinion on this point yet. I am also
skeptical of the idea of merging the FSM and VM, because it seems to
me that which pages are all-visible and which pages are full are two
different things. However, there's something to Peter's point too. An
empty page is all-visible but still valid as an insertion target, but
this is not as much of a contradiction to the idea of merging them as
it might seem, because marking the empty page as all-visible is not
nearly as useful as marking a full page all-visible. An index-only
scan can't care about the all-visible status of a page that doesn't
contain any tuples, and we're also likely to make the page non-empty
in the near future, in which case the work we did to set the
all-visible bit and log the change was wasted. The only thing we're
accomplishing by making the page all-visible is letting VACUUM skip
it, which will work out to a win if it stays empty for multiple VACUUM
cycles, but not otherwise.

Conceptually, you might think of the merged data structure as a
"quiescence map." Pages that aren't being actively updated are
probably all-visible and are probably not great insertion targets.
Those that are being actively updated are probably not all-visible and
have better chances of being good insertion targets. However, there
are a lot of gremlins buried in the word "probably." A page could get
full but still not be all-visible for a long time, due to long-running
transactions, and it doesn't seem likely to me that we can just ignore
that possibility. Nor on the other hand does the difference between an
old-but-mostly-full page and an old-but-mostly-empty page seem like
something we just want to ignore. So I don't know how a merged data
structure would actually end up being an improvement, exactly.

> To me this seems like it'd be better addressed by a shared, per-relfilenode,
> in-memory datastructure. Thomas Munro has been working on keeping accurate
> per-relfilenode relation size information. ISTM that that that'd be a better
> place to hook in for this.

+1. I had this same thought reading Peter's email. I'm not sure if it
makes sense to hook that into Thomas's work, but I think it makes a
ton of sense to use shared memory to coordinate transient state like
"hey, right now I'm inserting into this block, you guys leave it
alone" while using the disk for durable state like "here's how much
space this block has left."

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
Next
From: Robert Haas
Date:
Subject: Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)