Re: decoupling table and index vacuum - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: decoupling table and index vacuum
Date
Msg-id CAH2-Wzkvirvn0vL_bjzcejdwJM+2QKFu31FPs73JKaF+-+AmyQ@mail.gmail.com
Whole thread Raw
In response to Re: decoupling table and index vacuum  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Fri, Apr 23, 2021 at 8:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 22, 2021 at 4:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Mostly what I'm saying is that I would like to put together a rough
> > list of things that we could do to improve VACUUM along the lines
> > we've discussed -- all of which stem from $SUBJECT. There are
> > literally dozens of goals (some of which are quite disparate) that we
> > could conceivably set out to pursue under the banner of $SUBJECT.
>
> I hope not. I don't have a clue why there would be dozens of possible
> goals here, or why it matters.

Not completely distinct goals, for the most part, but I can certainly
see dozens of benefits.

For example, if we know before index vacuuming starts that heap
vacuuming definitely won't go ahead (quite possible when we decide
that we're only vacuuming a subset of indexes), we can then tell the
index AM about that fact. It can then safely vacuum in an
"approximate" fashion, for example by skipping pages whose LSNs are
from before the last VACUUM, and by not bothering with a
super-exclusive lock in the case of nbtree.

The risk of a conflict between this goal and another goal that we may
want to pursue (which might be a bit contrived) is that we fail to do
the right thing when a large range deletion has taken place, which
must be accounted in the statistics, but creates a tension with the
global index stuff. It's probably only safe to do this when we know
that there have been hardly any DELETEs. There is also the question of
how the TID map thing interacts with the visibility map, and how that
affects how VACUUM behaves (both in general and in order to attain
some kind of specific new benefit from this synergy).

Who knows? We're never going to get on exactly the same page, but some
rough idea of which page each of us are on might save everybody time.

The stuff that I went into about making aborted transactions special
as a means of decoupling transaction status management from garbage
collection is arguably totally unrelated -- perhaps it's just too much
of a stretch to link that to what you want to do now. I suppose it's
hard to invest the time to engage with me on that stuff, and I
wouldn't be surprised if you never did so. If it doesn't happen it
would be understandable, though quite possibly a missed opportunity
for both of us. My basic intuition there is that it's another variety
of decoupling, so (for better or worse) it does *seem* related to me.
(I am an intuitive thinker, which has advantages and disadvantages.)

> I think if we're going to do something
> like $SUBJECT, we should just concentrate on the best way to make that
> particular change happen with minimal change to anything else.
> Otherwise, we risk conflating this engineering effort with others that
> really should be separate endeavors.

Of course it's true that that is a risk. That doesn't mean that the
opposite risk is not also a concern. I am concerned about both risks.
I'm not sure which risk I should be more concerned about.

I agree that we ought to focus on a select few goals as part of the
first round of work in this area (without necessarily doing all or
even most of them at the same time). It's not self-evident which goals
those should be at this point, though. You've said that you're
interested in global indexes. Okay, that's a start. I'll add the basic
idea of not doing index vacuuming for some indexes and not others to
the list -- this will necessitate that we teach index AMs to assess
how much bloat the index has accumulated since the last VACUUM, which
presumably must work in some generalized, composable way.

> For example, as far as possible,
> I think it would be best to try to do this without changing the
> statistics that are currently gathered, and just make the best
> decisions we can with the information we already have.

I have no idea if that's the right way to do it. In any case the
statistics that we gather influence the behavior of autovacuum.c, but
nothing stops us from doing our own dynamic gathering of statistics to
decide what we should do within vacuumlazy.c each time. We don't have
to change the basic triggering conditions to change the work each
VACUUM performs.

As I've said before, I think that we're likely to get more benefit (at
least at first) from making the actual reality of what VACUUM does
simpler and more predictable in practice than we are from changing how
reality is modeled inside autovacuum.c. I'll go further with that now:
if we do change that modelling at some point, I think that it should
work in an additive way, which can probably be compatible with how the
statistics and so on work already. For example, maybe vacuumlazy.c
asks autovacuum.c to do a VACUUM earlier next time. This can be
structured as an exception to the general rule of autovacuum
scheduling, probably -- something that occurs when it becomes evident
that the generic schedule isn't quite cutting it in some important,
specific way.

> Ideally, I'd
> like to avoid introducing a new kind of relation fork that uses a
> different on-disk storage format (e.g. 16MB segments that are dropped
> from the tail) rather than the one used by the other forks, but I'm
> not sure we can get away with that, because conveyor-belt storage
> looks pretty appealing here.

No opinion on that just yet.

> Regardless, the more we have to change to
> accomplish the immediate goal, the more likely we are to introduce
> instability into places where it could have been avoided, or to get
> tangled up in endless bikeshedding.

Certainly true. I'm not really trying to convince you of specific
actionable points just yet, though. Perhaps that was the problem (or
perhaps it simply led to miscommunication). It would be so much easier
to discuss some of this stuff at an event like pgCon. Oh well.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: [PATCH] Remove extraneous whitespace in tags: > foo< and >bar <
Next
From: Peter Geoghegan
Date:
Subject: Re: decoupling table and index vacuum