Re: Why mark empty pages all visible? - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Why mark empty pages all visible?
Date
Msg-id CAH2-Wzm+urHRojLwiwgKRiezNoe5e5x3AK-RSuHXJr+2C5KT1Q@mail.gmail.com
Whole thread Raw
In response to Re: Why mark empty pages all visible?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Tue, Mar 28, 2023 at 3:29 PM Andres Freund <andres@anarazel.de> wrote:
> Why is that? It's as clear a signal of concurrent activity on the buffer
> you're going to get.

Not immediately getting a cleanup lock in VACUUM is informative to the
extent that you only care about what is happening that very
nanosecond. If you look at which pages it happens to in detail, what
you seem to end up with is a whole bunch of noise, which (on its own)
tells you exactly nothing about what VACUUM really ought to be doing
with those pages. In almost all cases we could get a cleanup lock by
waiting for one millisecond and retrying.

I suspect that the cleanup lock thing might be a noisy, unreliable
proxy for the condition that you actually care about, in the context
of your work on relation extension. I bet there is a better signal to
go on, if you look for one.

> It's interesting to understand *why* we are doing what we are. I think it'd
> make sense to propose changing how things work around this, but I just don't
> feel like I have a good enough grasp for why we do some of the things we
> do. And given there's not a lot of comments around it and some of the comments
> that do exist are inconsistent with themselves, looking at the history seems
> like the next best thing?

I think that I know why Heikki had no comments about PageIsEmpty()
pages when the VM first went in, back in 2009: because it was just so
obvious that you'd treat them the same as any other initialized page,
it didn't seem to warrant a comment at all. The difference between a
page with 0 tuples and 1 tuple is the same difference between a page
with 1 tuple and a page with 2 tuples. A tiny difference (one extra
tuple), of no particular consequence.

I think that you don't see it that way now because you're focussed on
the hio.c view of things. That code looked very different back in
2009, and in any case is very far removed from vacuumlazy.c.

I can tell you what I was thinking of with lazy_scan_new_or_empty: I
hate special cases. I will admit to being a zealot about it.

> > I gather that you *don't* want to do anything about the
> > lazy_vacuum_heap_page behavior with setting empty pages all-visible (just
> > the lazy_scan_new_or_empty behavior).
>
> Not in the short-medium term, at least. In the long term I do suspect we might
> want to do something about it.  We have a *crapton* of contention in the FSM
> and caused by the FSM in bulk workloads. With my relation extension patch
> disabling the FSM nearly doubles concurrent load speed.

I've seen the same effect myself. There is no question that that's a
big problem.

I think that the problem is that the system doesn't have any firm
understanding of pages as things that are owned by particular
transactions and/or backends, at least to a limited, scoped extent.
It's all really low level, when it actually needs to be high level and
take lots of context that comes from the application into account.

> At the same time, the fact that we might loose knowledge about all the
> existing free space in case of promotion or crash and never rediscover that
> space (because the pages are frozen), seems decidedly not great.

Unquestionably.

> I don't know what the path forward is, but it seems somewhat clear that we
> ought to do something. I suspect having a not crash-safe FSM isn't really
> acceptable anymore - it probably is fine to not persist a *reduction* in free
> space, but we can't permanently loose track of free space, no matter how many
> crashes.

Strongly agreed. It's a terrible false economy. If we bit the bullet
and made relation extension and the FSM crash safe, we'd gain so much
more than we'd lose.

> One thing I am fairly certain about is that using the FSM to tell other
> backends about newly bulk extended pages is not a good solution, even though
> we're stuck with it for the moment.

Strongly agreed.

> > I think that you must be arguing for making the early lifetime of a
> > heap page special to VACUUM, since AFAICT you want to change VACUUM's
> > behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not*
> > with pages that have one remaining LP_UNUSED item, but are otherwise
> > empty (which could be set all-visible/all-frozen in either the first
> > or second heap pass, even if we disabled the lazy_scan_new_or_empty()
> > behavior you're complaining about). You seem to want to distinguish
> > between very new pages (that also happen to be empty), and old pages
> > that happen to be empty. Right?
>
> I think that might be worthwhile, yes. The retry code in
> RelationGetBufferForTuple() is quite hairy and almost impossible to test. If
> we can avoid the complexity, at a fairly bound cost (vacuum needing to
> re-visit new/empty pages if they're currently pinned), it'd imo be more that
> worth the price.

Short term, you could explicitly say that PageIsEmpty() means that the
page is qualitatively different to other empty pages that were left
behind by VACUUM's second phase.

> But perhaps the better path forward is to just bite the bullet and introduce a
> shared memory table of open files, that contains "content size" and "physical
> size" for each relation.  We've had a lot of things over the years that'd have
> benefitted from that.

Strongly agreed on this.

> To address the RelationGetBufferForTuple() case, vacuum would simply not scan
> beyond the "content size", so it'd never encounter the page that
> RelationGetBufferForTuple() is currently dealing with.  And we'd not need to
> add bulk extended pages into the FSM.
>
> This would also, I think, be the basis for teaching vacuum to truncate
> relations without acquiring an AEL - which IME causes a lot of operational
> issues.

I have said the same exact thing myself at least once. Again, it's a
question of marrying high level and low level information. That is key
here.

> I wonder what it'd take to make the FSM "more crashsafe". Leaving aside the
> cases of "just extended" new/empty pages: We already call
> XLogRecordPageWithFreeSpace() for HEAP2_VACUUM, HEAP2_VISIBLE,
> XLOG_HEAP2_PRUNE as well as insert/multi_insert/update. However, we don't set
> the LSN of FSM pages. Which means we'll potentially flush dirty FSM buffers to
> disk, before the corresponding WAL records make it to disk.

Part of the problem is that we remember the amount of free space in
each heap page with way too much granularity. That adds to the
contention problem, because backends fight it out in a mad dash to
locate miniscule amounts of free space. Moreover, If there were (say)
only 5 or 7 distinct increments of free space that the FSM could
represent for each heap page, then true crash safety becomes a lot
cheaper.

I'll say it again: high level and low level information need to be combined.

> That'd still leve us with upper level corruption, but I guess we could just
> recompute those in some circumstances?

I think that we should just bite the bullet and come up with a way to
make it fully crash safe. No its, no buts.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Request for comment on setting binary format output per session
Next
From: Michael Paquier
Date:
Subject: Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry