Re: The Free Space Map: Problems and Opportunities - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: The Free Space Map: Problems and Opportunities
Date
Msg-id CAH2-Wzm+=o6A-dkDpXzb24ayau1XXQVeDO5V_aVbYsafu4Wh8w@mail.gmail.com
Whole thread Raw
In response to Re: The Free Space Map: Problems and Opportunities  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: The Free Space Map: Problems and Opportunities  (Robert Haas <robertmhaas@gmail.com>)
Re: The Free Space Map: Problems and Opportunities  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Tue, Aug 17, 2021 at 9:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't know what the right degree of stickiness is, but I can easily
> believe that we want to have more than none. Part of Peter's point, at
> least as I understand it, is that if a page has 100 tuples and you
> delete 1 or 2 or 3 of them, it is not smart to fill up the space thus
> created with 1 or 2 or 3 new tuples. You should instead hope that
> space will optimize future HOT updates, or else wait for the page to
> lose some larger number of tuples and then fill up all the space at
> the same time with tuples that are, hopefully, related to each other
> in some useful way.

That is a big part of it. But it gets worse than that. It is truly
insidious, once you follow the chain of problems over time -- maybe
over hours and hours. BenchmarkSQL makes this a lot easier.

Right now when we set heap fill factor to 85 (which is what
BenchmarkSQL does), we effectively have this chunk of special reserved
free space left behind on each page. The remaining 15% of tuple space
is of course reserved for successor versions of the same logical rows
-- those rows from the time that the page first fills (presumably they
all come from straight inserts). That's our starting point -- simple
enough.

The approach we take effectively gives up on the *whole entire page*
at literally the first sign of trouble -- the first time we fail to
hold a successor version on that same page. Even though it just well
have been a once-off perturbation. What we might call bad luck, as
opposed to an inevitability. For example, maybe ANALYZE ran that
second, which made the crucial difference for that one heap page that
one time. That's kind of enough to make the implementation assume all
bets are off, for the entire page -- it's as if we pessimistically
assume that there is no chance of keeping anything like the original
rows on the heap page, so we might as well let it all get mixed up
even more. This is way too pessimistic -- it's black and white
thinking.

While in general that level of pessimism might turn out to be
justified (anything is possible), we should hold the line and find out
for sure how bad it is. I happen to know that our problematic
BenchmarkSQL/TPC-C tables just don't have any skew, so this pessimism
is particularly wrong-headed there.

Bear in mind that the new unrelated tuples that ultimately replace our
original couldn't-fit tuples (probably after the next VACUUM) are
probably not "average tuples". Maybe they're newly inserted tuples,
which is bad in the BenchmarkSQL case, because they're all
*guaranteed* to be updated in the future (owing to the details of the
workload). With some other typical workload they might often be
successor versions from updates instead -- most likely for rows that
are disproportionately likely to be updated many times. It's truly
pernicious.

If we only accepted the original row migration, and didn't throw good
money after bad, then the cycle would be broken. To me this is a bit
like certain heuristic algorithms, where you technically accept a
suboptimal outcome to avoid getting stuck in a local maxima.

> Now what's the threshold? 20 out of 100? 50? 80?

I'm not going to pretend to know the answer. But I will point out that
one DB system whose heap fill factor defaults to 90 seems to have a
symmetric setting for the "open up page again" point -- the default
for that is 40. Not sure that that really matters to us, but that does
seem pretty low to me. It's very sticky indeed.

> Yeah, I don't think that reducing FSM_CATEGORIES is likely to have
> much value for its own sake. But it might be useful as a way of
> accomplishing some other goal. For example if we decided that we need
> a bit per page to track "open" vs. "closed" status or something of
> that sort, I don't think we'd lose much by having fewer categories.

This is probably one of my less important points. I addressed this
point in my remarks on this to Andres from earlier.

> I don't have a well-formed opinion on this point yet. I am also
> skeptical of the idea of merging the FSM and VM, because it seems to
> me that which pages are all-visible and which pages are full are two
> different things.

As I clarified in my email to Andres, this was just a bad choice of
words. I meant that they're at least conceptually much closer,
provided you're willing to accept my general view of things.

> However, there's something to Peter's point too. An
> empty page is all-visible but still valid as an insertion target, but
> this is not as much of a contradiction to the idea of merging them as
> it might seem, because marking the empty page as all-visible is not
> nearly as useful as marking a full page all-visible. An index-only
> scan can't care about the all-visible status of a page that doesn't
> contain any tuples, and we're also likely to make the page non-empty
> in the near future, in which case the work we did to set the
> all-visible bit and log the change was wasted. The only thing we're
> accomplishing by making the page all-visible is letting VACUUM skip
> it, which will work out to a win if it stays empty for multiple VACUUM
> cycles, but not otherwise.

That's a big part of it. The other thing is systemic effects, which is
related to the part of this email above about chain of events. I'll
talk about that some more now.

Lets say that VACUUM puts a non-zero usable amount of free space in
the FSM for a page that it marks all-visible/all-frozen at the same
time -- this is possible today, of course. To me this seems
contradictory, at least when there isn't much space -- which I believe
is common. It's at least a minor waste to set the VM bit.

Now let's say we change VACUUM to not waste effort on setting the heap
page -- we're now avoiding minor waste, which is an okay goal. But
we're also doing something else, that I find very objectionable: we're
effectively betting that the situation will improve for the same page
at some point in the future, during a future VACUUM. It is possible
that we could get just one or two inserts on the same page, and then
see it again, and then we're done. If that happens, we win our bet.
Otherwise, we lose our bet.

Here's the problem with making that bet: it's typically an awful bet.
The potential upside is clearly pretty low. And more importantly,
there is a decent chance that we'll actually get some row that is
going to mess everything up -- not an "average row" (whatever that may
mean). Again, very frequently updated rows are often what'll end up on
the page, which are enough to make the page never have its all-visible
bit set.

Here is why I find this *extremely* objectionable: we *actually* risk
experiencing a downside for the heap page by taking this awful bet
*many many times* in the future -- it's not just a once-off downside,
I believe. This is not in fact a choice about one heap page on one
occasion -- it is actually a *policy* affecting the page again and
again (and every other heap page too). It is only by making a
not-quite-physically-full page "closed" in the way that I advocate we
avoid taking this awful bet -- that is how it's possible to artificially *make*
this a question about one page on one occasion. We're bounding the
downside, lessening our exposure to the worst case. Intuitively, I
think that the possible downside is essentially unlimited, if we
assume systemic destabilization risks (which seems wise). Whereas the
possible upside of taking that awful bet is clearly, measurably small.
A once-off upside. Peanuts, really.

(Of course my argument won't work if you assume that even with the
"closed" page choice we naturally get more updates to one of the aged
frozen rows anyway. But we're counting on that not happening in either
scenario/with *either* strategy, so I don't think that it invalidates
anything I've said.)

> Conceptually, you might think of the merged data structure as a
> "quiescence map." Pages that aren't being actively updated are
> probably all-visible and are probably not great insertion targets.
> Those that are being actively updated are probably not all-visible and
> have better chances of being good insertion targets. However, there
> are a lot of gremlins buried in the word "probably." A page could get
> full but still not be all-visible for a long time, due to long-running
> transactions, and it doesn't seem likely to me that we can just ignore
> that possibility.

I think that we should opportunistically be setting the all-frozen bit
in heap pages (and freezing tuples earlier than required) anyway. For
example, we could notice that a whole heap page is eligible to be
marked all-frozen, and freeze the tuples early -- we could restart
processing of the page once we realized that freezing early is a good
idea.

Why even have a distinct all-visible VM bit? The per-tuple overhead of
the WAL records for freezing are kind of bulky, but that could be
fixed. It's not a good reason. (Of course I'm not arguing that we
should break pg_upgrade, just that there is no reason not to freeze
early when it's cheap to do so in passing.)

> +1. I had this same thought reading Peter's email. I'm not sure if it
> makes sense to hook that into Thomas's work, but I think it makes a
> ton of sense to use shared memory to coordinate transient state like
> "hey, right now I'm inserting into this block, you guys leave it
> alone" while using the disk for durable state like "here's how much
> space this block has left."

I agree that that's the best starting point for this.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: PG14: Avoid checking output-buffer-length for every encoded byte during pg_hex_encode
Next
From: Michael Paquier
Date:
Subject: Re: Support for NSS as a libpq TLS backend