Re: New IndexAM API controlling index vacuum strategies - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: New IndexAM API controlling index vacuum strategies |
Date | |
Msg-id | 20210415033803.cc6mrp224ho4vea7@alap3.anarazel.de Whole thread Raw |
In response to | Re: New IndexAM API controlling index vacuum strategies (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: New IndexAM API controlling index vacuum strategies
|
List | pgsql-hackers |
Hi, On 2021-04-14 19:53:29 -0700, Peter Geoghegan wrote: > > Or at least > > tests for it should be added (pg_resetwal + autovacuum_naptime=1s or > > such should make that doable, or even just running a small test with > > lower thresholds). > > You know what else doesn't have test coverage? Any kind of aggressive > VACUUM. There is a problem with our culture around testing. I would > like to address that in the scope of this project, but you know how it > is. Can I take it that I'll have your support with adding those tests? Sure! > > I think there are good arguments for having logic for an "emergency > > vacuum" mode (and also some good ones against). I'm not convinced that > > the current set of things that are [not] skipped in failsafe mode is the > > "obviously right set of things"™ but am convinced that there wasn't > > enough consensus building o what that set of things is. This all also > > would be different if it were the start of the development window, > > rather than the end. > > I all but begged you to review the patches. Same with Robert. While > the earlier patches (where almost all of the complexity is) did get > review from both you and Robert (which I was grateful to receive), for > whatever reason neither of you looked at the later patches in detail. Based on a quick scan of the thread, the first version of a patch that kind of resembles what got committed around the topic at hand seems to be https://postgr.es/m/CAH2-Wzm7Y%3D_g3FjVHv7a85AfUbuSYdggDnEqN1hodVeOctL%2BOw%40mail.gmail.com posted 2021-03-15. That's well into the last CF. The reason I didn't do further reviews for things in this thread was that I was trying really hard to get the shared memory stats patch into a committable shape - there were just not enough hours in the day. I think it's to be expected that, during the final CF, there aren't a lot of resources for reviewing patches that are substantially new. Why should these new patches have gotten priority over a much older patch set that also address significant operational issues? > > I think there's also a clear danger in having "cliffs" where the > > behaviour changes appruptly once a certain threshold is reached. It's > > not unlikely for systems to fall over entirely over when > > > > a) autovacuum cost limiting is disabled. E.g. reaching your disk > > iops/throughput quota and barely being able to log into postgres > > anymore to kill the stuck connection causing the wraparound issue. > > Let me get this straight: You're concerned that hurrying up vacuuming > when we have 500 million XIDs left to burn will overwhelm the system, > which would presumably have finished in time otherwise? > Even though it would have to do way more work in absolute terms in the > absence of the failsafe? And even though the 1.6 billion XID age that > we got to before the failsafe kicked in was clearly not enough? You'd > want to "play it safe", and stick with the original plan at that > point? It's very common for larger / busier databases to *substantially* increase autovacuum_freeze_max_age, so there won't be 1.6 billion XIDs of headroom, but a few hundred million. The cost of doing unnecessary anti-wraparound vacuums is just too great. And databases on the busier & larger side of things are precisely the ones that are more likely to hit wraparound issues (otherwise you're just not that likely to burn through that many xids). And my concern isn't really that vacuum would have finished without a problem if cost limiting hadn't been disabled, but that having multiple autovacuum workers going all out will cause problems. Like the system slowing down so much that it's hard to fix the actual root cause of the wraparound - I've seen systems with a bunch unthrottled autovacuum overwhelme the IO subsystem so much that simply opening a connection to fix the issue took 10+ minutes. Especially on systems with provisioned IO (i.e. just about all cloud storage) that's not too hard to hit. > > b) No index cleanup happens anymore. E.g. a workload with a lot of > > bitmap index scans (which do not support killtuples) could end up a > > off a lot worse because index pointers to dead tuples aren't being > > cleaned up. In cases where an old transaction or leftover replication > > slot is causing the problem (together a significant percentage of > > wraparound situations) this situation will persist across repeated > > (explicit or automatic) vacuums for a table, because relfrozenxid > > won't actually be advanced. And this in turn might actually end up > > slowing resolution of the wraparound issue more than doing all the > > index scans. > > If it's intrinsically impossible to advance relfrozenxid, then surely > all bets are off. But even in this scenario it's very unlikely that we > wouldn't at least do index vacuuming for those index tuples that are > dead and safe to delete according to the OldestXmin cutoff. You still > have 1.6 billion XIDs before the failsafe first kicks in, regardless > of the issue of the OldestXmin/FreezeLimit being excessively far in > the past. As I said above, I don't think the "1.6 billion XIDs" argument has merit, because it's so reasonable (and common) to set autovacuum_freeze_max_age to something much larger. > You're also not acknowledging the benefit of avoiding uselessly > scanning the indexes again and again, which is mostly what would be > happening in this scenario. Maybe VACUUM shouldn't spin like this at > all, but that's not a new problem. I explicitly said that there's benefits to skipping index scans? Greetings, Andres Freund
pgsql-hackers by date: