Re: 64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: 64-bit XIDs in deleted nbtree pages |
Date | |
Msg-id | CAH2-WznRT7aoZvuSKarXJmRdT_RmaQOku4p3AiK1fm8NKwggXA@mail.gmail.com Whole thread Raw |
In response to | Re: 64-bit XIDs in deleted nbtree pages (Heikki Linnakangas <hlinnaka@iki.fi>) |
List | pgsql-hackers |
On Tue, Feb 9, 2021 at 11:58 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Thanks for picking this up! I actually had a patch for this in 2019, albeit one that remained in rough shape until recently. Must have forgotten about it. > Is it really worth the trouble to maintain 'level' on deleted pages? All > you currently do with it is check that the BTP_LEAF flag is set iff > "level == 0", which seems pointless. I guess there could be some > forensic value in keeping 'level', but meh. What trouble is that? The only way in which it's inconvenient is that we have to include the level field in xl_btree_unlink_page WAL records for the first time. The structure of the relevant REDO routine (which is called btree_xlog_unlink_page()) ought to explicitly recreate the original page from scratch, without any special cases. This makes it possible to pretend that there never was such a thing as an nbtree page whose level field could not be relied on. I personally think that it's simpler when seen in the wider context of how the code works and is verified. Besides, there is also amcheck to consider. I am a big believer in amcheck, and see it as something that has enabled my work on the B-Tree code over the past few years. Preserving the level field in deleted pages increases our coverage just a little, and practically eliminates cases where we cannot rely on the level field. Of course it's still true that this detail (the deleted pages level field question) will probably never seem important to anybody else. To me it's one small detail of a broader strategy. No one detail of that broader strategy, taken in isolation, will ever be crucially important. Of course it's also true that we should not assume that a very high cost in performance/code/whatever can justify a much smaller benefit in amcheck. But you haven't really explained why the cost seems unacceptable to you. (Perhaps I missed something.) > How about: > > INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages > DETAIL: 250000 index row versions and 686 pages were removed. > 2056 index pages are now unused, 1370 are currently reusable. > > The idea is that the first DETAIL line now says what the VACUUM did this > round, and the last line says what the state of the index is now. One > concern with that phrasing is that it might not be clear what "686 pages > were removed" means. > It's still a bit weird that the "what VACUUM did this round" information > is sandwiched between the two other lines that talk about the state of > the index after the operation. But I think the language now makes it > more clear which is which. IMV our immediate goal for the new VACUUM VERBOSE output should be to make the output as accurate and descriptive as possible (while still using terminology that works for all index AMs, not just nbtree). I don't think that we should give too much weight to making the information easy to understand in isolation. Because that's basically impossible -- it just doesn't work that way IME. Confusion over the accounting of "deleted pages in indexes" vs "pages deleted by this VACUUM" is not new. See my bugfix commit 73a076b0 to see one vintage example. The relevant output of VACUUM VERBOSE produced inconsistent results for perhaps as long as 15 years before I noticed it and fixed it. I somehow didn't notice this despite using it for various tests for my own B-Tree projects a year or two before the fix. Tests that produced inconsistent results that I noticed pretty early on, and yet assumed were all down to some subtlety that I didn't yet understand. My point is this: I am quite prepared to admit that these details really are complicated. But that's not incidental to what's really going on, or anything (though I agree with your later remarks on the general tidiness of VACUUM VERBOSE -- it is a real dog's dinner). I'm not saying that we should assume that no DBA will find the relevant VACUUM VERBOSE output useful -- I don't think that at all. It will be kind of rare for a user to really comb through it. But that's mostly because big problems in this area are themselves kind of rare (most individual indexes never have any deleted pages IME). Any DBA consuming this output sensibly will consume it in a way that makes sense in the *context of the problem that they're experiencing*, whatever that might mean for them. They'll consider how it changes over time for the same index. They'll try to correlate it with other symptoms, or other problems, and make sense of it in a top-down fashion. We should try to make it as descriptive as possible so that DBAs will have the breadcrumbs they need to tie it back to whatever the core issue happens to be -- maybe they'll have to read the source code to get to the bottom of it. It's likely to be some rare issue in those cases where the DBA really cares about the details -- it's likely to be workload dependent. Good DBAs spend much of their time on exceptional problems -- all the easy problems will have been automated away already. Things like wait events are popular with DBAs for this reason. > Or perhaps flip the INFO and first DETAIL > lines around like this: > INFO: 250000 index row versions and 686 pages were removed from index > "foo_pkey" > DETAIL: index now contains 250001 row versions in 2745 pages. > 2056 index pages are now unused, of which 1370 are currently reusable. > > For context, the more full message you get on master is: > That's pretty confusing, it's a mix of basically progress indicators > (vacuuming "public.foo"), CPU measurements, information about what was > removed, and what the state is afterwards. I agree that the output of VACUUM VERBOSE is messy. It's probably a bunch of accretions that made sense in isolation, but added up to a big mess over time. So I agree: now would be a good time to do something about that. It would also be nice to find a way to get this information in the logs when log_autovacuum is enabled (perhaps only when the verbosity is increased). I've discussed this with Masahiko in the context of his recent work, actually. Even before we started talking about the XID page deletion problem that I'm fixing here. > INFO: "foo_pkey": removed 250000 index row versions and 686 pages > DETAIL: index now contains 250001 row versions in 2745 pages. > 2056 index pages are now unused, of which 1370 are currently reusable. I can see what you mean here, and maybe we should do roughly what you've outlined. Still, we should use terminology that isn't too far removed from what actually happens in nbtree. What's a "removed" page? The distinction between all of the different kinds of index pages that might be involved here is just subtle. Again, better to use a precise, descriptive term that nobody fully understands -- because hardly anybody will fully understand it anyway (even including advanced users that go on to find the VACUUM VERBOSE output very useful for whatever reason). -- Peter Geoghegan
pgsql-hackers by date: