Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation |
Date | |
Msg-id | CA+TgmoaFsgAAhVDq8N=nNdSA+EFLxM_bdF165LYAAvRr11bXmQ@mail.gmail.com Whole thread Raw |
In response to | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
|
List | pgsql-hackers |
On Mon, Jan 16, 2023 at 11:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jan 16, 2023 at 8:25 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I really dislike formulas like Min(freeze_max_age * 2, 1 billion). > > That looks completely magical from a user perspective. Some users > > aren't going to understand autovacuum behavior at all. Some will, and > > will be able to compare age(relfrozenxid) against > > autovacuum_freeze_max_age. Very few people are going to think to > > compare age(relfrozenxid) against some formula based on > > autovacuum_freeze_max_age. I guess if we document it, maybe they will. > > What do you think of Andres' autovacuum_no_auto_cancel_age proposal? I like it better than your proposal. I don't think it's a fundamental improvement and I would rather see a fundamental improvement, but I can see it being better than nothing. > > I do like the idea of driving the auto-cancel behavior off of the > > results of previous attempts to vacuum the table. That could be done > > independently of the XID age of the table. > > Even when the XID age of the table has already significantly surpassed > autovacuum_freeze_max_age, say due to autovacuum worker starvation? > > > If we've failed to vacuum > > the table, say, 10 times, because we kept auto-cancelling, it's > > probably appropriate to force the issue. > > I suggested 1000 times upthread. 10 times seems very low, at least if > "number of times cancelled" is the sole criterion, without any > attention paid to relfrozenxid age or some other tiebreaker. Hmm, I think that a threshold of 1000 is far too high to do much good. By the time we've tried to vacuum a table 1000 times and failed every time, I anticipate that the situation will be pretty dire, regardless of why we thought the table needed to be vacuumed in the first place. In the best case, with autovacum_naptime=1minute, failing 1000 times means that we've delayed vacuuming the table for at least 16 hours. That's assuming that there's a worker available to retry every minute and that we fail quickly. If it's a 2 hour vacuum operation and we typically fail about halfway through, it could take us over a month to hit 1000 failures. There are many tables out there that get enough inserts, updates, and deletes that a 16-hour delay will result in irreversible bloat, never mind a 41-day delay. After even a few days, wraparound could become critical even if bloat isn't. I'm not sure why a threshold of 10 would be too low. It seems to me that if we fail ten times in a row to vacuum a table and fail for the same reason every time, we're probably going to keep failing for that reason. If that is true, we will be better off if we force the issue sooner rather than later. There's no value in letting the table bloat out the wazoo and the cluster approach a wraparound shutdown before we insist. Consider a more mundane example. If I try to start my car or my dishwasher or my computer or my toaster oven ten times and it fails ten times in a row, and the failure mode appears to be the same each time, I am not going to sit there and try 990 more times hoping things get better, because that seems very unlikely to help. Honestly, depending on the situation, I might not even get to ten times before I switch to doing some form of troubleshooting and/or calling someone who could repair the device. In fact I think there's a decent argument that a threshold of ten is possibly too high here, too. If you wait until the tenth try before you try not auto-cancelling, then a table with a workload that makes auto-cancelling 100% probable will get vacuumed 10% as often as it would otherwise. I think there are cases where that would be OK, but probably on the whole it's not going to go very well. The only problem I see with lowering the threshold below ~10 is that the signal starts to get weak. If something fails for the same reason ten times in a row you can be pretty sure it's a chronic problem. If you made the thereshold say three you'd probably start making bad decisions sometimes -- you'd think that you had a chronic problem when really you just got a bit unlucky. To get back to the earlier question above, I think that if the retries-before-not-auto-cancelling threshold were low enough to be effective, you wouldn't necessarily need to consider XID age as a second reason for not auto-cancelling. You would want to force the behavior anyway when you hit emergency mode, because that should force all the mitigations we have, but I don't know that you need to do anything before that. > > It doesn't really matter > > whether the autovacuum triggered because of bloat or because of XID > > age. Letting either of those things get out of control is bad. > > While inventing a new no-auto-cancel behavior that prevents bloat from > getting completely out of control may well have merit, I don't see why > it needs to be attached to this other effort. It doesn't, but I think it would be a lot more beneficial than just adding a new GUC. A lot of the fundamental stupidity of autovacuum comes from its inability to consider the context. I've had various ideas over the years about how to fix that, but this is far simpler than some things I've thought about and I think it would help a lot of people. > That factor (the mistake factor) doesn't mean I take the point any > less seriously. What I don't take seriously is the idea that the > precise XID age was ever crucially important. I agree. That's why I think driving this off of number of previous failures would be better than driving it off of an XID age. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: