Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing - Mailing list pgsql-hackers

From samay sharma
Subject Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing
Date
Msg-id CAJxrbyxgWq3KZ3ExLbQ_S-H8zT8md5kU5=RD2bqV4LCo8chQ9A@mail.gmail.com
Whole thread Raw
In response to Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Hi,

On Wed, May 3, 2023 at 2:59 PM Peter Geoghegan <pg@bowt.ie> wrote:
Hi Samay,

On Tue, May 2, 2023 at 11:40 PM samay sharma <smilingsamay@gmail.com> wrote:
> Thanks for taking the time to do this. It is indeed difficult work.

Thanks for the review! I think that this is something that would
definitely benefit from a perspective such as yours.

Glad to hear that my feedback was helpful.
 

> There are things I like about the changes you've proposed and some where I feel that the previous section was easier to understand.

That makes sense, and I think that I agree with every point you've
raised, bar none. I'm pleased to see that you basically agree with the
high level direction.

I would estimate that the version you looked at (v2) is perhaps 35%
complete. So some of the individual problems you noticed were a direct
consequence of the work just not being anywhere near complete. I'll
try to do a better job of tracking the relative maturity of each
commit/patch in each commit message, going forward.

Anything that falls under "25.2.1. Recovering Disk Space" is
particularly undeveloped in v2. The way that I broke that up into a
bunch of WARNINGs/NOTEs/TIPs was just a short term way of breaking it
up into pieces, so that the structure was very approximately what I
wanted. I actually think that the stuff about CLUSTER and VACUUM FULL
belongs in a completely different chapter. Since it is not "Routine
Vacuuming" at all.

>> 2. Renamed "Preventing Transaction ID Wraparound Failures" to
>> "Freezing to manage the transaction ID space". Now we talk about
>> wraparound as a subtopic of freezing, not vice-versa. (This is a
>> complete rewrite, as described by later items in this list).
>
> +1 on this too. Freezing is a normal part of vacuuming and while the aggressive vacuums are different, I think just talking about the worst case scenario while referring to it is alarmist.

Strangely enough, Postgres 16 is the first version that instruments
freezing in its autovacuum log reports. I suspect that some long term
users will find it quite surprising to see how much (or how little)
freezing takes place in non-aggressive VACUUMs.

The introduction of page-level freezing will make it easier and more
natural to tune settings like vacuum_freeze_min_age, with the aim of
smoothing out the burden of freezing over time (particularly by making
non-aggressive VACUUMs freeze more). Page-level freezing removes any
question of not freezing every tuple on a page (barring cases where
"removable cutoff" is noticeably held back by an old MVCC snapshot).
This makes it more natural to think of freezing as a process that
makes it okay to store data in individual physical heap pages, long
term.

> 1) While I agree that bundling VACUUM and VACUUM FULL is not the right way, moving all VACUUM FULL references into tips and warnings also seems excessive. I think it's probably best to just have a single paragraph which talks about VACUUM FULL as I do think it should be mentioned in the reclaiming disk space section.

As I mentioned briefly already, my intention is to move it to another
chapter entirely. I was thinking of "Chapter 29. Monitoring Disk
Usage". The "Routine Vacuuming" docs would then link to this sect1 --
something along the lines of "non-routine commands to reclaim a lot of
disk space in the event of extreme bloat".

> 2) I felt that the new section, "Freezing to manage the transaction ID space" could be made simpler to understand. As an example, I understood what the parameters (autovacuum_freeze_max_age, vacuum_freeze_table_age) do and how they interact better in the previous version of the docs.

Agreed. I'm going to split it up some more. I think that the current
"25.2.2.1. VACUUM's Aggressive Strategy" should be split in two, so we
go from talking about aggressive VACUUMs to Antiwraparound
autovacuums. Finding the least confusing way of explaining it has been
a focus of mine in the last few days.

To be honest, this was not super simple to understand even in the previous version. However, as our goal is to simplify this and make it easier to understand, I'll hold this patch-set to a higher standard :).

I wish there was a simple representation (maybe even a table or something) which would explain the differences between a VACUUM which is not aggressive, a VACUUM which ends up being aggressive due to vacuum_freeze_table_age and an antiwraparound autovacuum.
 

> 4) I think we should explicitly call out that seeing an anti-wraparound VACUUM or "VACUUM table (to prevent wraparound)" is normal and that it's just a VACUUM triggered due to the table having unfrozen rows with an XID older than autovacuum_freeze_max_age. I've seen many users panicking on seeing this and feeling that they are close to a wraparound.

That has also been my exact experience. Users are terrified, usually
for no good reason at all. I'll make sure that this comes across in
the next revision of the patch series.

Thinking about it a bit more, I wonder if there's value in changing the "(to prevent wraparound)" to something else. It's understandable why people who just see that in pg_stat_activity and don't read docs might assume they are close to a wraparound.

Regards,
Samay
 

> Also, we should be more clear about how it's different from VACUUMs triggered due to the scale factors (cancellation behavior, being triggered when autovacuum is disabled etc.).

Right. Though I think that the biggest point of confusion for users is
how *few* differences there really are between antiwraparound
autovacuum, and any other kind of autovacuum that happens to use
VACUUM's aggressive strategy. There is really only one important
difference: the autocancellation behavior. This is an autovacuum
behavior, not a VACUUM behavior -- so the "VACUUM side" doesn't know
anything about that at all.  

> 5) Can we use a better name for the XidStopLimit mode? It seems like a very implementation centric name. Maybe a better version of "Running out of the XID space" or something like that?

Coming up with a new user-facing name for xidStopLimit is already on
my TODO list (it's surprisingly hard). I have used that name so far
because it unambiguously refers to the exact thing that I want to talk
about when discussing the worst case. Other than that, it's a terrible
name.

> 6) In the XidStopLimit mode section, it would be good to explain briefly why you could get to this scenario. It's not something which should happen in a normal running system unless you have a long running transaction or inactive replication slots or a badly configured system or something of that sort.

I agree that that's important. Note that there is already something
about "removable cutoff" being held back at the start of the
discussion of freezing -- that will prevent freezing in exactly the
same way as it prevents cleanup of dead tuples.

That will become a WARNING box in the next revision. There should also
be a similar, analogous WARNING box (about "removable cutoff" being
held back) much earlier on in the docs -- this should appear in
"25.2.1. Recovering Disk Space". Obviously this structure suggests
that there is an isomorphism between freezing and removing bloat. For
example, if you cannot "freeze" an XID that appears in some tuple's
xmax, then you also cannot remove that tuple because VACUUM only sees
it as a recently dead tuple (if xmax is >= OldestXmin/removable
cutoff, and from a deleter that already committed).

I don't think that we need to spell the "isomorphism" point out to the
reader directly, but having a subtle cue that that's how it works
seems like a good idea.

> If you got to this point, other than running VACUUM to get out of the situation, it's also important to figure out what got you there in the first place as many VACUUMs should have attempted to advance the relfrozenxid and failed.

It's also true that problems that can lead to the system entering
xidStopLimit mode aren't limited to cases where doing required
freezing is fundamentally impossible due to something holding back
"removable cutoff". It's also possible that VACUUM simply can't keep
up (though the failsafe has helped with that problem a lot).

I tend to agree that there needs to be more about this in the
xidStopLimit subsection (discussion of freezing being held back by
"removable cutoff" is insufficient), but FWIW that seems like it
should probably be treated as out of scope for this patch. It is more
the responsibility of the other patch [1] that aims to put the
xidStopLimit documentation on a better footing (and remove that
terrible HINT about single user mode).

Of course, that other patch is closely related to this patch -- the
precise boundaries are unclear at this point. In any case I think that
this should happen, because I think that it's a good idea.

> There are a few other small things I noticed along the way but my goal was to look at the overall structure.

Thanks again! This is very helpful.

[1] https://www.postgresql.org/message-id/flat/CAJ7c6TM2D277U2wH8X78kg8pH3tdUqebV3_JCJqAkYQFHCFzeg%40mail.gmail.com
--
Peter Geoghegan

pgsql-hackers by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: Re: pg_stat_io not tracking smgrwriteback() is confusing
Next
From: Tom Lane
Date:
Subject: Re: A bug with ExecCheckPermissions