Re: decoupling table and index vacuum - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: decoupling table and index vacuum
Date
Msg-id CAH2-Wzm2LFd=1v3JxXL8d0SHhMSXdyoJRQO0tn0H2iT5pzC_ug@mail.gmail.com
Whole thread Raw
In response to Re: decoupling table and index vacuum  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: decoupling table and index vacuum  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, Apr 22, 2021 at 12:27 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I agree strongly with this. In fact, I seem to remember saying similar
> things to you in the past. If something wins $1 in 90% of cases and
> loses $5 in 10% of cases, is it a good idea? Well, it depends on how
> the losses are distributed. If every user can be expected to hit both
> winning and losing cases with approximately those frequencies, then
> yes, it's a good idea, because everyone will come out ahead on
> average. But if 90% of users will see only wins and 10% of users will
> see only losses, it sucks.

Right. It's essential that we not disadvantage any workload by more
than a small fixed amount (and only with a huge upside elsewhere).

The even more general version is this: the average probably doesn't
even exist in any meaningful sense.

Bottom-up index deletion tends to be effective either 100% of the time
or 0% of the time, which varies on an index by index basis. Does that
mean we should split the difference, and assume that it's effective
50% of the time? Clearly not. Clearly that particular framing is just
wrong. And clearly it basically doesn't matter if it's half of all
indexes, or a quarter, or none, whatever. Because it's all of those
proportions, and also because who cares.

> That being said, I don't know what this really has to do with the
> proposal on the table, except in the most general sense. If you're
> just saying that decoupling stuff is good because different indexes
> have different needs, I am in agreement, as I said in my OP.

Mostly what I'm saying is that I would like to put together a rough
list of things that we could do to improve VACUUM along the lines
we've discussed -- all of which stem from $SUBJECT. There are
literally dozens of goals (some of which are quite disparate) that we
could conceivably set out to pursue under the banner of $SUBJECT.
Ideally there would be soft agreement about which ideas were more
promising. Ideally we'd avoid painting ourselves into a corner with
respect to one of these goals, in pursuit of any other goal.

I suspect that we'll need somewhat more of a top-down approach to this
work, which is something that we as a community don't have much
experience with. It might be useful to set the parameters of the
discussion up-front, which seems weird to me too, but might actually
help. (A lot of the current problems with VACUUM seem like they might
be consequences of pgsql-hackers not usually working like this.)

> It sort
> of sounded like you were saying that it's not important to try to
> estimate the number of undeleted dead tuples in each index, which
> puzzled me, because while knowing doesn't mean everything is
> wonderful, not knowing it sure seems worse. But I guess maybe that's
> not what you were saying, so I don't know.

I agree that it matters that we are able to characterize how bloated a
partial index is, because an improved VACUUM implementation will need
to know that. My main point about that was that it's complicated in
surprising ways that actually matter. An approximate solution seems
quite possible to me, but I think that that will probably have to
involve the index AM directly.

Sometimes 10% - 30% of the extant physical index tuples will be dead
and it'll be totally fine in every practical sense -- the index won't
have grown by even one page since the last VACUUM! Other times it
might be as few as 2% - 5% that are now dead when VACUUM is
considered, which will in fact be a serious problem (e.g., it's
concentrated in one part of the keyspace, say). I would say that
having some rough idea of which case we have on our hands is extremely
important here. Even if the distinction only arises in rare cases
(though FWIW I don't think that these differences will be rare at
all).

(I also tried to clarify what I mean about qualitative bloat in
passing in my response about the case of a bloated partial index.)

> I feel like we're in danger
> of drifting into discussions about whether we're disagreeing with each
> other rather than, as I would like, focusing on how to design a system
> for $SUBJECT.

While I am certainly guilty of being kind of hand-wavy and talking
about lots of stuff all at once here, it's still kind of unclear what
practical benefits you hope to attain through $SUBJECT. Apart from the
thing about global indexes, which matters but is hardly the
overwhelming reason to do all this. I myself don't expect your goals
to be super crisp just yet. As I said, I'm happy to talk about it in
very general terms at first -- isn't that what you were doing
yourself?

Or did I misunderstand -- are global indexes mostly all that you're
thinking about here? (Even if they are all you care about, it still
seems like you're still somewhat obligated to generalize the dead TID
fork/map thing to help with a bunch of other things, just to justify
the complexity of adding a dead TID relfork.)

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: tab-complete for ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Next
From: Michael Paquier
Date:
Subject: Re: multi-install PostgresNode fails with older postgres versions