Re: should vacuum's first heap pass be read-only? - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: should vacuum's first heap pass be read-only?
Date
Msg-id CAH2-Wzkt9Ey9NNm7q9nSaw5jdBjVsAq3yvb4UT4M93UaJVd_xg@mail.gmail.com
Whole thread Raw
In response to Re: should vacuum's first heap pass be read-only?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: should vacuum's first heap pass be read-only?  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, Mar 31, 2022 at 1:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't think we want to, exactly. Maybe it's easier to store
> > redundant TIDs than to avoid storing them in the first place (we can
> > probably just accept some redundancy). There is a trade-off to be made
> > there. I'm not at all sure of what the best trade-off is, though.
>
> My intuition is that storing redundant TIDs will turn out to be a very
> bad idea. I think that if we do that, the system will become prone to
> a new kind of vicious cycle (to try to put this in words similar to
> the ones you've been using to write about similar effects).

I don't feel very strongly about it either way. I definitely think
that there are workloads for which that will be true, and I in general
I am no fan of putting off work that cannot possibly turn out to be
unnecessary in the end. I am in favor of "good laziness", not "bad
laziness".

> The more data we collect on the conveyor belt, the
> harder it's going to be when we eventually need to deduplicate.
>
> Also, I think it's possible to deduplicate at a very reasonable cost
> so long as (1) we enter each batch of TIDs into the conveyor belt as a
> distinguishable "run"

I definitely think that's the way to go, in general (regardless of
what we do about deduplicating TIDs).

> and (2) we never accumulate so many of these
> runs at the same time that we can't do a single merge pass to turn
> them into a sorted output stream. We're always going to discover dead
> TIDs in sorted order, so as we make our pass over the heap, we can
> simultaneously be doing an on-the-fly merge pass over the existing
> runs that are on the conveyor belt. That means we never need to store
> all the dead TIDs in memory at once.

That's a good idea IMV. I am vaguely reminded of an LSM tree.

> We just fetch enough data from
> each "run" to cover the block numbers for which we're performing the
> heap scan, and when the heap scan advances we throw away data for
> blocks that we've passed and fetch data for the blocks we're now
> reaching.

I wonder if you've thought about exploiting the new approach to
skipping pages using the visibility map from my patch series (which
you reviewed recently). I think that putting that in scope here could
be very helpful. As a way of making the stuff we already want to do
with [global] indexes easier, but also as independently useful work
based on the conveyor belt. The visibility map is very underexploited
as a source of accurate information about what VACUUM should be doing
IMV (in autovacuum.c's scheduling logic, but also in vacuumlazy.c
itself).

Imagine a world in which we decide up-front what pages we're going to
scan (i.e. our final vacrel->scanned_pages), by first scanning the
visibility map, and serializing it in local memory, or sometimes in
disk using the conveyor belt. Now you'll have a pretty good idea how
much TID deduplication will be necessary when you're done (to give one
example). In general "locking in" a plan of action for VACUUM seems
like it would be very useful. It will tend to cut down on the number
of "dead but not yet removable" tuples that VACUUM encounters right
now -- you at least avoid visiting concurrently modified heap pages
that were all-visible back when OldestXmin was first established.

When all skippable ranges are known in advance, we can reorder things
in many different ways -- since the work of vacuuming can be
decomposed on the lazy_scan_heap side, too. The only ordering
dependency among heap pages (that I can think of offhand) is FSM
vacuuming, which seems like it could be addressed without great
difficulty.

Ideally everything can be processed in whatever order is convenient as
of right now, based on costs and benefits. We could totally decompose
the largest VACUUM operations into individually processable unit of
work (heap and index units), so that individual autovacuum workers no
longer own particular VACUUM operations at all. Autovacuum workers
would instead multiplex all of the units of work from all pending
VACUUMs, that are scheduled centrally, based on the current load of
the system. We can probably afford to be much more sensitive to the
number of pages we'll dirty relative to what the system can take right
now, and so on.

Cancelling an autovacuum worker may just make the system temporarily
suspend the VACUUM operation for later with this design (in fact
cancelling an autovacuum may not really be meaningful at all). A
design like this could also enable things like changing our mind about
advancing relfrozenxid: if we decided to skip all-visible pages in
this VACUUM operation, and come to regret that decision by the end
(because lots of XIDs are consumed in the interim), maybe we can go
back and get the all-visible pages we missed first time around.

I don't think that all of these ideas will turn out to be winners, but
I'm just trying to stimulate discussion. Breaking almost all
dependencies on the order that we process things in seems like it has
real promise.

> > It all depends, of course. The decision needs to be made using a cost
> > model. I suspect it will be necessary to try it out, and see.
>
> But having said that, coming back to this with fresh eyes, I think
> Dilip has a really good point here. If the first thing we do at the
> start of every VACUUM is scan the heap in a way that is guaranteed to
> rediscover all of the dead TIDs that we've previously added to the
> conveyor belt plus maybe also new ones, we may as well just forget the
> whole idea of having a conveyor belt at all.

I definitely agree that that's bad, and would be the inevitable result
of being lazy about deduplicating consistently.

> The only way the conveyor belt system has any
> value is if we think that there is some set of circumstances where the
> heap scan is separated in time from the index vacuum, such that we
> might sometimes do an index vacuum without having done a heap scan
> just before.

I agree.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: head fails to build on SLES 12 (wal_compression=zstd)
Next
From: Tomas Vondra
Date:
Subject: Re: unlogged sequences