Re: New IndexAM API controlling index vacuum strategies - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: New IndexAM API controlling index vacuum strategies |
Date | |
Msg-id | 20210415015304.3svyy74uvwcbo7nc@alap3.anarazel.de Whole thread Raw |
In response to | Re: New IndexAM API controlling index vacuum strategies (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: New IndexAM API controlling index vacuum strategies
|
List | pgsql-hackers |
Hi, On 2021-04-14 14:55:36 -0700, Peter Geoghegan wrote: > On Wed, Apr 14, 2021 at 12:33 PM Andres Freund <andres@anarazel.de> wrote: > > I'm getting a bit bothered by the speed at which you're pushing fairly > > substantial behavioural for vacuum. In this case without even a warning > > that you're about to do so. > > To a large degree the failsafe is something that is written in the > hope that it will never be needed. This is unlike most other things, > and has its own unique risks. Among them that the code is not covered by tests and is unlikely to be meaningfully exercised within the beta timeframe due to the timeframes for hitting it (hard to actually hit below a 1/2 day running extreme workloads, weeks for more realistic ones). Which means that this code has to be extra vigorously reviewed, not the opposite. Or at least tests for it should be added (pg_resetwal + autovacuum_naptime=1s or such should make that doable, or even just running a small test with lower thresholds). > I just went one step further than that in this recent commit. I didn't > point these details out before now because (to me) this is beside the > point. Which is that the failsafe is just that -- a failsafe. Anything > that adds unnecessary unpredictable delay in reaching the point of > advancing relfrozenxid should be avoided. (Besides, the design of > should_attempt_truncation() and lazy_truncate_heap() is very far from > guaranteeing that truncation will take place at the best of times.) This line of argumentation scares me. Not explained arguments, about running in conditions that we otherwise don't run in, when in exceptional circumstances. This code has a history of super subtle interactions, with quite a few data loss causing bugs due to us not forseeing some combination of circumstances. I think there are good arguments for having logic for an "emergency vacuum" mode (and also some good ones against). I'm not convinced that the current set of things that are [not] skipped in failsafe mode is the "obviously right set of things"™ but am convinced that there wasn't enough consensus building o what that set of things is. This all also would be different if it were the start of the development window, rather than the end. In my experience the big problem with vacuums in a wraparound situation isn't actually things like truncation or eventhe index scans (although they certainly can cause bad problems), but that VACUUM modifies (prune/vacuum and WAL log or just setting hint bits) a crapton of pages that don't actually need to be modified just to be able to get out of the wraparound situation. And that the overhead of writing out all those dirty pages + WAL logging causes the VACUUM to take unacceptably long. E.g. because your storage is cloud storage with a few ms of latency, and the ringbuffer + wal_buffer sizes cause so many synchronous writes that you end up with < 10MB/s of data being processed. I think there's also a clear danger in having "cliffs" where the behaviour changes appruptly once a certain threshold is reached. It's not unlikely for systems to fall over entirely over when a) autovacuum cost limiting is disabled. E.g. reaching your disk iops/throughput quota and barely being able to log into postgres anymore to kill the stuck connection causing the wraparound issue. b) No index cleanup happens anymore. E.g. a workload with a lot of bitmap index scans (which do not support killtuples) could end up a off a lot worse because index pointers to dead tuples aren't being cleaned up. In cases where an old transaction or leftover replication slot is causing the problem (together a significant percentage of wraparound situations) this situation will persist across repeated (explicit or automatic) vacuums for a table, because relfrozenxid won't actually be advanced. And this in turn might actually end up slowing resolution of the wraparound issue more than doing all the index scans. Because this is a hard cliff rather than something phasing in, it's not really possible to for a user to see this slowly getting worse and addressing the issue. Especially for a) this could be addressed by not turning off cost limiting at once, but instead have it decrease the closer you get to some limit. Greetings, Andres Freund
pgsql-hackers by date: