Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation |
Date | |
Msg-id | CAH2-Wznsg-fp1vJR9_qLe6sRWqVVBiQsqt20xKwNBFqdLif84g@mail.gmail.com Whole thread Raw |
In response to | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
|
List | pgsql-hackers |
On Mon, Jan 16, 2023 at 8:25 AM Robert Haas <robertmhaas@gmail.com> wrote: > I really dislike formulas like Min(freeze_max_age * 2, 1 billion). > That looks completely magical from a user perspective. Some users > aren't going to understand autovacuum behavior at all. Some will, and > will be able to compare age(relfrozenxid) against > autovacuum_freeze_max_age. Very few people are going to think to > compare age(relfrozenxid) against some formula based on > autovacuum_freeze_max_age. I guess if we document it, maybe they will. What do you think of Andres' autovacuum_no_auto_cancel_age proposal? As I've said several times already, I am by no means attached to the current formula. > I do like the idea of driving the auto-cancel behavior off of the > results of previous attempts to vacuum the table. That could be done > independently of the XID age of the table. Even when the XID age of the table has already significantly surpassed autovacuum_freeze_max_age, say due to autovacuum worker starvation? > If we've failed to vacuum > the table, say, 10 times, because we kept auto-cancelling, it's > probably appropriate to force the issue. I suggested 1000 times upthread. 10 times seems very low, at least if "number of times cancelled" is the sole criterion, without any attention paid to relfrozenxid age or some other tiebreaker. > It doesn't really matter > whether the autovacuum triggered because of bloat or because of XID > age. Letting either of those things get out of control is bad. While inventing a new no-auto-cancel behavior that prevents bloat from getting completely out of control may well have merit, I don't see why it needs to be attached to this other effort. I think that the vast majority of individual tables have autovacuums cancelled approximately never, and so my immediate concern is ameliorating cases where not being able to auto-cancel once in a blue moon causes an outage. Sure, the opposite problem also exists, and I think that it would be really bad if it was made significantly worse as an unintended consequence of a patch that addressed just the first problem. But that doesn't mean we have to solve both problems together at the same time. > But at that point a lot of harm has already > been done. In a frequently updated table, waiting 300 million XIDs to > stop cancelling the vacuum is basically condemning the user to have to > run VACUUM FULL. The table can easily be ten or a hundred times bigger > than it should be by that point. The rate at which relfrozenxid ages is just about useless as a proxy for how much wall clock time has passed with a given workload -- workloads are usually very bursty. It's much worse still as a proxy for what has changed in the table; completely static tables have their relfrozenxid age at exactly the same rate as the most frequently updated table in the same database (the table that "consumes the most XIDs"). So while antiwraparound autovacuum no-auto-cancel behavior may indeed save the user from problems with serious bloat, it will happen pretty much by mistake. Not that it doesn't happen all the same -- of course it does. That factor (the mistake factor) doesn't mean I take the point any less seriously. What I don't take seriously is the idea that the precise XID age was ever crucially important. More generally, I just don't accept that this leaves with no room for something along the lines of my proposed, such as Andres' autovacuum_freeze_max_age concept. As I've said already, there will usually be a very asymmetric quality to the problem in cases like the Joyent outage. Even a modest amount of additional XID-space-headroom will very likely be all that will be needed at the critical juncture. It may not be perfect, but it still has every potential to make things safer for some users, without making things any less safe for other users. -- Peter Geoghegan
pgsql-hackers by date: