Re: decoupling table and index vacuum - Mailing list pgsql-hackers

From Robert Haas
Subject Re: decoupling table and index vacuum
Date
Msg-id CA+TgmoYf4PZ-zu4jYmgbQcSj5HRjs5briGuCBSWbAv9-woXySA@mail.gmail.com
Whole thread Raw
In response to Re: decoupling table and index vacuum  (Andres Freund <andres@anarazel.de>)
Responses Re: decoupling table and index vacuum  (Peter Geoghegan <pg@bowt.ie>)
Re: decoupling table and index vacuum  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:
> I'm not sure that's the only way to deal with this. While some form of
> generic "conveyor belt" infrastructure would be a useful building block,
> and it'd be sensible to use it here if it existed, it seems feasible to
> dead tids in a different way here. You could e.g. have per-heap-vacuum
> files with a header containing LSNs that indicate the age of the
> contents.

That's true, but have some reservations about being overly reliant on
the filesystem to provide structure here. There are good reasons to be
worried about bloating the number of files in the data directory. Hmm,
but maybe we could mitigate that. First, we could skip this for small
relations. If you can vacuum the table and all of its indexes using
the naive algorithm in <10 seconds, you probably shouldn't do anything
fancy. That would *greatly* reduce the number of additional files
generated. Second, we could forget about treating them as separate
relation forks and make them some other kind of thing entirely, in a
separate directory, especially if we adopted Sawada-san's proposal to
skip WAL logging. I don't know if that proposal is actually a good
idea, because it effectively adds a performance penalty when you crash
or fail over, and that sort of thing can be an unpleasant surprise.
But it's something to think about.

> > This scheme adds a lot of complexity, which is a concern, but it seems
> > It's not completely independent: if you need to set some dead TIDs in
> > the table to unused, you may have to force index vacuuming that isn't
> > needed for bloat control. However, you only need to force it for
> > indexes that haven't been vacuumed recently enough for some other
> > reason, rather than every index.
>
> Hm - how would we know how recently that TID has been marked dead? We
> don't even have xids for dead ItemIds... Maybe you're intending to
> answer that in your next paragraph, but it's not obvious to me that'd be
> sufficient...

You wouldn't know anything about when things were added in terms of
wall clock time, but the idea was that TIDs get added in order and
stay in that order. So you know which ones were added first. Imagine a
conceptually infinite array of TIDs:

(17,5) (332,6) (5, 1) (2153,92) ....

Each index keeps a pointer into this array. Initially it points to the
start of the array, here (17,5). If an index vacuum starts after
(17,5) and (332,6) have been added to the array but before (5,1) is
added, then upon completion it updates its pointer to point to (5,1).
If every index is pointing to (5,1) or some later element, then you
know that (17,5) and (332,6) can be set LP_UNUSED. If not, and you
want to get to a state where you CAN set (17,5) and (332,6) to
LP_UNUSED, you just need to force index vac on indexes that are
pointing to something prior to (5,1) -- and keep forcing it until
those pointers reach (5,1) or later.

> One thing that you didn't mention so far is that this'd allow us to add
> dead TIDs to the "dead tid" file outside of vacuum too. In some
> workloads most of the dead tuple removal happens as part of on-access
> HOT pruning. While some indexes are likely to see that via the
> killtuples logic, others may not. Being able to have more aggressive
> index vacuum for the one or two bloated index, without needing to rescan
> the heap, seems like it'd be a significant improvement.

Oh, that's a very interesting idea. It does impose some additional
requirements on any such system, though, because it means you have to
be able to efficiently add single TIDs. For example, you mention a
per-heap-VACUUM file above, but you can't get away with creating a new
file per HOT prune no matter how you arrange things at the FS level.
Actually, though, I think the big problem here is deduplication. A
full-blown VACUUM can perhaps read all the already-known-to-be-dead
TIDs into some kind of data structure and avoid re-adding them, but
that's impractical for a HOT prune.

> Have you thought about how we would do the scheduling of vacuums for the
> different indexes? We don't really have useful stats for the number of
> dead index entries to be expected in an index. It'd not be hard to track
> how many entries are removed in an index via killtuples, but
> e.g. estimating how many dead entries there are in a partial index seems
> quite hard (at least without introducing significant overhead).

No, I don't have any good ideas about that, really. Partial indexes
seem like a hard problem, and so do GIN indexes or other kinds of
things where you may have multiple index entries per heap tuple. We
might have to accept some known-to-be-wrong approximations in such
cases.

> > One rather serious objection to this whole line of attack is that we'd
> > ideally like VACUUM to reclaim disk space without using any more, in
> > case the motivation for running VACUUM in the first place.
>
> I suspect we'd need a global limit of space used for this data. If above
> that limit we'd switch to immediately performing the work required to
> remove some of that space.

I think that's entirely the wrong approach. On the one hand, it
doesn't prevent you from running out of disk space during emergency
maintenance, because the disk overall can be full even though you're
below your quota of space for this particular purpose. On the other
hand, it does subject you to random breakage when your database gets
big enough that the critical information can't be stored within the
configured quota. I think we'd end up with pathological cases very
much like what used to happen with the fixed-size free space map. What
happened there was that your database got big enough that you couldn't
track all the free space any more and it just started bloating out the
wazoo. What would happen here is that you'd silently lose the
well-optimized version of VACUUM when your database gets too big. That
does not seem like something anybody wants.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: problem with RETURNING and update row movement
Next
From: Tom Stellard
Date:
Subject: Re: Do we work with LLVM 12 on s390x?