Re: decoupling table and index vacuum - Mailing list pgsql-hackers

From Andres Freund
Subject Re: decoupling table and index vacuum
Date
Msg-id 20210422184400.ebnoe6gu42b4a67h@alap3.anarazel.de
Whole thread Raw
In response to Re: decoupling table and index vacuum  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: decoupling table and index vacuum  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Hi,

On 2021-04-22 11:30:21 -0700, Peter Geoghegan wrote:
> I think that you're both missing very important subtleties here.
> Apparently the "quantitative vs qualitative" distinction I like to
> make hasn't cleared it up.

I'm honestly getting a bit annoyed about this stuff. Yes it's a cool
improvement, but no, it doesn't mean that there aren't still relevant
issues in important cases. It doesn't help that you repeatedly imply
that people that don't see it your way need to have their view "cleared
up".

"Bottom up index deletion" is practically *irrelevant* for a significant
set of workloads.


> You both seem to be assuming that everything would be fine if you
> could somehow inexpensively know the total number of undeleted dead
> tuples in each index at all times.

I don't think we'd need an exact number. Just a reasonable approximation
so we know whether it's worth spending time vacuuming some index.


> But I don't think that that's true at all. I don't mean that it might
> not be true. What I mean is that it's usually a meaningless number *on
> its own*, at least if you assume that every index is either an nbtree
> index (or an index that uses some other index AM that has the same
> index deletion capabilities).

You also have to assume that you have roughly evenly distributed index
insertions and deletions. But workloads that insert into some parts of a
value range and delete from another range are common.

I even would say that *precisely* because "Bottom up index deletion" can
be very efficient in some workloads it is useful to have per-index stats
determining whether an index should be vacuumed or not.


> My mental models for index bloat usually involve imagining an
> idealized version of a real world bloated index -- I compare the
> empirical reality against an imagined idealized version. I then try to
> find optimizations that make the reality approximate the idealized
> version. Say a version of the same index in a traditional 2PL database
> without MVCC, or in real world Postgres with VACUUM that magically
> runs infinitely fast.
> 
> Bottom-up index deletion usually leaves a huge number of
> undeleted-though-dead index tuples untouched for hours, even when it
> works perfectly. 10% - 30% of the index tuples might be
> undeleted-though-dead at any given point in time (traditional B-Tree
> space utilization math generally ensures that there is about that much
> free space on each leaf page if we imagine no version churn/bloat --
> we *naturally* have a lot of free space to work with). These are
> "Schrodinger's dead index tuples". You could count them
> mechanistically, but then you'd be counting index tuples that are
> "already dead and deleted" in an important theoretical sense, despite
> the fact that they are not yet literally deleted. That's why bottom-up
> index deletion frequently avoids 100% of all unnecessary page splits.
> The asymmetry that was there all along was just crazy. I merely had
> the realization that it was there and could be exploited -- I didn't
> create or invent the natural asymmetry.

Except that heap bloat not index bloat might be the more pressing
concern. Or that there will be no meaningful amount of bottom-up
deletions. Or ...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: decoupling table and index vacuum
Next
From: Robert Haas
Date:
Subject: Re: decoupling table and index vacuum