Re: 64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: 64-bit XIDs in deleted nbtree pages
Date
Msg-id CAH2-WznRT7aoZvuSKarXJmRdT_RmaQOku4p3AiK1fm8NKwggXA@mail.gmail.com
Whole thread Raw
In response to Re: 64-bit XIDs in deleted nbtree pages  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On Tue, Feb 9, 2021 at 11:58 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Thanks for picking this up!

I actually had a patch for this in 2019, albeit one that remained in
rough shape until recently. Must have forgotten about it.

> Is it really worth the trouble to maintain 'level' on deleted pages? All
> you currently do with it is check that the BTP_LEAF flag is set iff
> "level == 0", which seems pointless. I guess there could be some
> forensic value in keeping 'level', but meh.

What trouble is that? The only way in which it's inconvenient is that
we have to include the level field in xl_btree_unlink_page WAL records
for the first time. The structure of the relevant REDO routine (which
is called btree_xlog_unlink_page()) ought to explicitly recreate the
original page from scratch, without any special cases. This makes it
possible to pretend that there never was such a thing as an nbtree
page whose level field could not be relied on. I personally think that
it's simpler when seen in the wider context of how the code works and
is verified.

Besides, there is also amcheck to consider. I am a big believer in
amcheck, and see it as something that has enabled my work on the
B-Tree code over the past few years. Preserving the level field in
deleted pages increases our coverage just a little, and practically
eliminates cases where we cannot rely on the level field.

Of course it's still true that this detail (the deleted pages level
field question) will probably never seem important to anybody else. To
me it's one small detail of a broader strategy. No one detail of that
broader strategy, taken in isolation, will ever be crucially
important.

Of course it's also true that we should not assume that a very high
cost in performance/code/whatever can justify a much smaller benefit
in amcheck. But you haven't really explained why the cost seems
unacceptable to you. (Perhaps I missed something.)

> How about:
>
> INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
> DETAIL:  250000 index row versions and 686 pages were removed.
> 2056 index pages are now unused, 1370 are currently reusable.
>
> The idea is that the first DETAIL line now says what the VACUUM did this
> round, and the last line says what the state of the index is now. One
> concern with that phrasing is that it might not be clear what "686 pages
> were removed" means.

> It's still a bit weird that the "what VACUUM did this round" information
> is sandwiched between the two other lines that talk about the state of
> the index after the operation. But I think the language now makes it
> more clear which is which.

IMV our immediate goal for the new VACUUM VERBOSE output should be to
make the output as accurate and descriptive as possible (while still
using terminology that works for all index AMs, not just nbtree). I
don't think that we should give too much weight to making the
information easy to understand in isolation. Because that's basically
impossible -- it just doesn't work that way IME.

Confusion over the accounting of "deleted pages in indexes" vs "pages
deleted by this VACUUM" is not new. See my bugfix commit 73a076b0 to
see one vintage example. The relevant output of VACUUM VERBOSE
produced inconsistent results for perhaps as long as 15 years before I
noticed it and fixed it. I somehow didn't notice this despite using it
for various tests for my own B-Tree projects a year or two before the
fix. Tests that produced inconsistent results that I noticed pretty
early on, and yet assumed were all down to some subtlety that I didn't
yet understand.

My point is this: I am quite prepared to admit that these details
really are complicated. But that's not incidental to what's really
going on, or anything (though I agree with your later remarks on the
general tidiness of VACUUM VERBOSE -- it is a real dog's dinner).

I'm not saying that we should assume that no DBA will find the
relevant VACUUM VERBOSE output useful -- I don't think that at all. It
will be kind of rare for a user to really comb through it. But that's
mostly because big problems in this area are themselves kind of rare
(most individual indexes never have any deleted pages IME).

Any DBA consuming this output sensibly will consume it in a way that
makes sense in the *context of the problem that they're experiencing*,
whatever that might mean for them. They'll consider how it changes
over time for the same index. They'll try to correlate it with other
symptoms, or other problems, and make sense of it in a top-down
fashion. We should try to make it as descriptive as possible so that
DBAs will have the breadcrumbs they need to tie it back to whatever
the core issue happens to be -- maybe they'll have to read the source
code to get to the bottom of it. It's likely to be some rare issue in
those cases where the DBA really cares about the details -- it's
likely to be workload dependent.

Good DBAs spend much of their time on exceptional problems -- all the
easy problems will have been automated away already. Things like wait
events are popular with DBAs for this reason.

> Or perhaps flip the INFO and first DETAIL
> lines around like this:

> INFO: 250000 index row versions and 686 pages were removed from index
> "foo_pkey"
> DETAIL: index now contains 250001 row versions in 2745 pages.
> 2056 index pages are now unused, of which 1370 are currently reusable.
>
> For context, the more full message you get on master is:

> That's pretty confusing, it's a mix of basically progress indicators
> (vacuuming "public.foo"), CPU measurements, information about what was
> removed, and what the state is afterwards.

I agree that the output of VACUUM VERBOSE is messy. It's probably a
bunch of accretions that made sense in isolation, but added up to a
big mess over time. So I agree: now would be a good time to do
something about that.

It would also be nice to find a way to get this information in the
logs when log_autovacuum is enabled (perhaps only when the verbosity
is increased). I've discussed this with Masahiko in the context of his
recent work, actually. Even before we started talking about the XID
page deletion problem that I'm fixing here.

> INFO: "foo_pkey": removed 250000 index row versions and 686 pages
> DETAIL: index now contains 250001 row versions in 2745 pages.
> 2056 index pages are now unused, of which 1370 are currently reusable.

I can see what you mean here, and maybe we should do roughly what
you've outlined. Still, we should use terminology that isn't too far
removed from what actually happens in nbtree. What's a "removed" page?
The distinction between all of the different kinds of index pages that
might be involved here is just subtle. Again, better to use a precise,
descriptive term that nobody fully understands -- because hardly
anybody will fully understand it anyway (even including advanced users
that go on to find the VACUUM VERBOSE output very useful for whatever
reason).

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: "Bossart, Nathan"
Date:
Subject: Re: partial heap only tuples
Next
From: Andy Fan
Date:
Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey