Thread: Corrupt index stopping autovacuum system wide
Hello,
Over the weekend we noticed that our max transaction IDs have been continuously increasing - got an alert it passed 1B - and we believe that no autovacuums were running for a period of about a month by looking at pg_stat_user_tables. We had not updated any autovac tuning parameters over that time period and many tables were very much over the threshold for needing an autovac.
When investigating I located the table with the oldest transaction ID with:
SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age,
pg_size_pretty(pg_table_size(c.oid)) as table_size,
c.relkind,
c.relpersistence
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind in ('r', 't','m')
order by age desc limit 40;
I vacuumed that table manually with `vacuum freeze verbose table_xx` and got this error:
INFO: aggressively vacuuming "public.table_xx"
INFO: scanned index "table_xx_pkey" to remove 168443 row versions
DETAIL: CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.13 s
INFO: scanned index "table_xx_col_id" to remove 168443 row versions
DETAIL: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
ERROR: failed to re-find parent key in index "table_xx_col_idx" for deletion target page 217
I replaced the index (create a new concurrently, delete the old concurrently), vacuumed the table, and immediately autovacs started across the system and our XIDs started falling. To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other databases on the same server. Am I interpreting this correctly? Would love guidance on diagnosing this type of thing and strategies for preventing it.
Thanks,
Aaron
Over the weekend we noticed that our max transaction IDs have been continuously increasing - got an alert it passed 1B - and we believe that no autovacuums were running for a period of about a month by looking at pg_stat_user_tables. We had not updated any autovac tuning parameters over that time period and many tables were very much over the threshold for needing an autovac.
When investigating I located the table with the oldest transaction ID with:
SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age,
pg_size_pretty(pg_table_size(c.oid)) as table_size,
c.relkind,
c.relpersistence
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind in ('r', 't','m')
order by age desc limit 40;
I vacuumed that table manually with `vacuum freeze verbose table_xx` and got this error:
INFO: aggressively vacuuming "public.table_xx"
INFO: scanned index "table_xx_pkey" to remove 168443 row versions
DETAIL: CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.13 s
INFO: scanned index "table_xx_col_id" to remove 168443 row versions
DETAIL: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
ERROR: failed to re-find parent key in index "table_xx_col_idx" for deletion target page 217
I replaced the index (create a new concurrently, delete the old concurrently), vacuumed the table, and immediately autovacs started across the system and our XIDs started falling. To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other databases on the same server. Am I interpreting this correctly? Would love guidance on diagnosing this type of thing and strategies for preventing it.
Thanks,
Aaron
On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz <aaronepelz@gmail.com> wrote: > To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other databaseson the same server. Am I interpreting this correctly? Yes -- that is correct. What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11? > Would love guidance on diagnosing this type of thing and strategies for preventing it. It's possible that amcheck would have given you an accurate diagnosis of the problem -- especially if you used bt_index_parent_check(): https://www.postgresql.org/docs/current/amcheck.html -- Peter Geoghegan
> What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11?
On 11, and no it was just a normal btree.
> It's possible that amcheck would have given you an accurate diagnosis
> of the problem -- especially if you used bt_index_parent_check():I'll look into this, seems helpful. Thanks!
On Wed, Jul 17, 2019 at 12:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz <aaronepelz@gmail.com> wrote:
> To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other databases on the same server. Am I interpreting this correctly?
Yes -- that is correct.
What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11?
> Would love guidance on diagnosing this type of thing and strategies for preventing it.
It's possible that amcheck would have given you an accurate diagnosis
of the problem -- especially if you used bt_index_parent_check():
https://www.postgresql.org/docs/current/amcheck.html
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz <aaronepelz@gmail.com> wrote: >> To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other databaseson the same server. Am I interpreting this correctly? > Yes -- that is correct. It looks that way, but how would a broken non-shared index have held up autovacuuming in other databases? Maybe, as this one's xmin horizon got further and further behind, the launcher eventually stopped considering launching workers into any other databases? That seems like a bad thing; it's postponing work that will need to be done eventually. regards, tom lane
On Wed, Jul 17, 2019 at 9:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > It looks that way, but how would a broken non-shared index have held up > autovacuuming in other databases? Maybe, as this one's xmin horizon > got further and further behind, the launcher eventually stopped > considering launching workers into any other databases? That seems > like a bad thing; it's postponing work that will need to be done > eventually. I don't know exactly how the launcher would behave offhand, but it's clear that not being able to VACUUM one table in one database (because it has a corrupt index) ultimately risks the availability of every database in the cluster. Many installations receive little to no supervision, so it may just be a matter of time there. That is certainly a bad thing. -- Peter Geoghegan
On Wed, Jul 17, 2019 at 9:21 AM Peter Geoghegan <pg@bowt.ie> wrote: > It's possible that amcheck would have given you an accurate diagnosis > of the problem -- especially if you used bt_index_parent_check(): > > https://www.postgresql.org/docs/current/amcheck.html BTW, be sure to use the 'heapallindexed' option with bt_index_parent_check() to detect missing downlinks, which is exactly the problem that VACUUM complained about. Hopefully this probably will be limited to the single index that you've already REINDEXed. In theory the same problem could be hiding in other indexes, though I don't consider that particularly likely. Note that bt_index_parent_check() requires a lock on tables that effectively blocks writes, but not reads, so verification may require planning or coordination. bt_index_check() doesn't have any of these problems, but also won't detect missing downlinks specifically. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Wed, Jul 17, 2019 at 9:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It looks that way, but how would a broken non-shared index have held up >> autovacuuming in other databases? Maybe, as this one's xmin horizon >> got further and further behind, the launcher eventually stopped >> considering launching workers into any other databases? That seems >> like a bad thing; it's postponing work that will need to be done >> eventually. > I don't know exactly how the launcher would behave offhand, but it's > clear that not being able to VACUUM one table in one database (because > it has a corrupt index) ultimately risks the availability of every > database in the cluster. Many installations receive little to no > supervision, so it may just be a matter of time there. That is > certainly a bad thing. Right, you're eventually going to get to a forced shutdown if vacuum never succeeds on one table; no question that that's bad. My concern here is that if we have blinders on to the extent of only processing that one table or DB, we're unnecessarily allowing bloat to occur in other tables, and causing that missed vacuuming work to pile up so that there's more of it to be done once the breakage is cleared. If the DBA doesn't notice the problem until getting into a forced shutdown, that is going to extend his outage time --- and, in a really bad worst case, maybe make the difference between being able to recover at all and not. regards, tom lane
On Wed, Jul 17, 2019 at 10:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Right, you're eventually going to get to a forced shutdown if vacuum never > succeeds on one table; no question that that's bad. It occurs to me that we use operator class/insertion scankey comparisons within page deletion, to relocate a leaf page that looks like a candidate for deletion. Despite this, README.hot claims: "Standard vacuuming scans the indexes to ensure all such index entries are removed, amortizing the index scan cost across as many dead tuples as possible; this approach does not scale down well to the case of reclaiming just a few tuples. In principle one could recompute the index keys and do standard index searches to find the index entries, but this is risky in the presence of possibly-buggy user-defined functions in functional indexes. An allegedly immutable function that in fact is not immutable might prevent us from re-finding an index entry" That probably wasn't the problem in Aaron's case, but it is worth considering as a possibility. > My concern here is > that if we have blinders on to the extent of only processing that one > table or DB, we're unnecessarily allowing bloat to occur in other tables, > and causing that missed vacuuming work to pile up so that there's more of > it to be done once the breakage is cleared. If the DBA doesn't notice the > problem until getting into a forced shutdown, that is going to extend his > outage time --- and, in a really bad worst case, maybe make the difference > between being able to recover at all and not. The comment about "...any db at risk of Xid wraparound..." within do_start_worker() hints at such a problem. Maybe nbtree VACUUM should do something more aggressive than give up when there is a "failed to re-find parent key" or similar condition. Perhaps it would make more sense to make the index inactive (for some value of "inactive") instead of just complaining. That might be the least worst option, all things considered. -- Peter Geoghegan
On 2019-Jul-17, Peter Geoghegan wrote: > Maybe nbtree VACUUM should do something more aggressive than give up > when there is a "failed to re-find parent key" or similar condition. > Perhaps it would make more sense to make the index inactive (for some > value of "inactive") instead of just complaining. That might be the > least worst option, all things considered. Maybe we can mark an index as unvacuumable in some way? As far as I understand, all queries using that index work, as do index updates; it's just vacuuming that fails. If we mark the index as unvacuumable, then vacuum just skips it (and does not run phase 3 for that table), and things can proceed; the table's age can still be advanced. Obviously it'll result in more bloat than in normal condition, but it shouldn't cause the whole cluster to go down. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Jul-17, Peter Geoghegan wrote: >> Maybe nbtree VACUUM should do something more aggressive than give up >> when there is a "failed to re-find parent key" or similar condition. >> Perhaps it would make more sense to make the index inactive (for some >> value of "inactive") instead of just complaining. That might be the >> least worst option, all things considered. > Maybe we can mark an index as unvacuumable in some way? As far as I > understand, all queries using that index work, as do index updates; it's > just vacuuming that fails. If we mark the index as unvacuumable, then > vacuum just skips it (and does not run phase 3 for that table), and > things can proceed; the table's age can still be advanced. Obviously > it'll result in more bloat than in normal condition, but it shouldn't > cause the whole cluster to go down. If an index is corrupt enough to break vacuum, I think it takes a rather large leap of faith to believe that it's not going to cause problems for inserts or searches. I'd go with just marking the index broken and insisting that it be REINDEX'd before we touch it again. Not sure how we make that happen automatically though, or if we even should. (a) once the transaction's failed, you can't go making catalog updates; (b) even when you know the transaction's failed, blaming it on a particular index seems a bit chancy; (c) automatically disabling constraint indexes seems less than desirable. regards, tom lane
On 2019-Jul-17, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > On 2019-Jul-17, Peter Geoghegan wrote: > >> Maybe nbtree VACUUM should do something more aggressive than give up > >> when there is a "failed to re-find parent key" or similar condition. > >> Perhaps it would make more sense to make the index inactive (for some > >> value of "inactive") instead of just complaining. That might be the > >> least worst option, all things considered. > > > Maybe we can mark an index as unvacuumable in some way? As far as I > > understand, all queries using that index work, as do index updates; it's > > just vacuuming that fails. If we mark the index as unvacuumable, then > > vacuum just skips it (and does not run phase 3 for that table), and > > things can proceed; the table's age can still be advanced. Obviously > > it'll result in more bloat than in normal condition, but it shouldn't > > cause the whole cluster to go down. > > If an index is corrupt enough to break vacuum, I think it takes a rather > large leap of faith to believe that it's not going to cause problems for > inserts or searches. Maybe, but it's what happened in the reported case. (Note Aaron was careful to do the index replacement concurrently -- he wouldn't have done that if the table wasn't in active use.) > I'd go with just marking the index broken and > insisting that it be REINDEX'd before we touch it again. This might make things worse operationally, though. If searches aren't failing but vacuum is, we'd break a production system that currently works. > (a) once the transaction's failed, you can't go making catalog updates; Maybe we can defer the actual update to some other transaction -- say register an autovacuum work-item, which can be executed separately. > (b) even when you know the transaction's failed, blaming it on a > particular index seems a bit chancy; Well, vacuum knows what index is being processed. Maybe you're thinking that autovac can get an out-of-memory condition or something like that; perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED condition is reported (and make sure all such conditions do that. As far as I remember we have a patch for this particular error to be reported as such.) > (c) automatically disabling constraint indexes seems less than desirable. Disabling them for writes, yeah. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 17, 2019 at 11:43 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > This might make things worse operationally, though. If searches aren't > failing but vacuum is, we'd break a production system that currently > works. If searches aren't failing and VACUUM works, then that's probably down to dumb luck. The user's luck could change at any time (actually, it's quite possible that the index is already giving wrong answers without anybody realizing). That's not always true, of course -- you could have an OOM condition in VACUUM, where it really does make sense to retry. But it should be true for the category of errors where we behave more aggressively than just giving up, such as "failed to re-find parent key" error Aaron noticed. > Well, vacuum knows what index is being processed. Maybe you're thinking > that autovac can get an out-of-memory condition or something like that; > perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED > condition is reported (and make sure all such conditions do that. As > far as I remember we have a patch for this particular error to be > reported as such.) I don't think that it would be that hard to identify errors that nbtree VACUUM could throw that clearly indicate corruption, without any hope of the problem self-correcting without the DBA running a REINDEX. There will be a small amount of gray area, perhaps, but probably not enough to matter. > > (c) automatically disabling constraint indexes seems less than desirable. > > Disabling them for writes, yeah. I think that it's fair to say that all bets are off once you see the "failed to re-find parent key" error, or any other such error that indicates corruption. Admittedly it isn't 100% clear that disabling constraint enforcement to unblock autovacuum for the whole cluster is better than any available alternative; it's really hard to reason about things when we already know that the database has corruption. I think that it's okay that almost anything can break when somebody creates an index on a non-immutable expression (including VACUUM), provided that all the problems only affect the table with the broken index. OTOH, taking down the entire Postgres cluster as an indirect consequence of one person's ill-considered CREATE INDEX really sucks. That distinction seems important to me. -- Peter Geoghegan
On Wed, Jul 17, 2019 at 10:27 AM Peter Geoghegan <pg@bowt.ie> wrote: > > It's possible that amcheck would have given you an accurate diagnosis > > of the problem -- especially if you used bt_index_parent_check(): > > > > https://www.postgresql.org/docs/current/amcheck.html > > BTW, be sure to use the 'heapallindexed' option with > bt_index_parent_check() to detect missing downlinks, which is exactly > the problem that VACUUM complained about. Can you tell us more about this index? Can you share its definition (i.e. what does \d show in psql)? Is it an expression index, or a partial index? A composite? What datatypes are indexed? Thanks -- Peter Geoghegan
Hi, On 2019-07-17 13:27:23 -0400, Tom Lane wrote: > My concern here is that if we have blinders on to the extent of only > processing that one table or DB, we're unnecessarily allowing bloat to > occur in other tables, and causing that missed vacuuming work to pile > up so that there's more of it to be done once the breakage is cleared. That a pretty common problem in the real world, with or without problems like corrupted indexes. Autovacuum's scheduling is just too simplistic to avoid that. Common problems: - If one table in one database has an xmin older than autovacuum_freeze_max_age, the autovacuum launcher continuously throws more workers at that database. Even though there's a lot of work to be done in other databases. That's because do_start_worker() ignores the "adl_next_worker" mechanism, and *always* starts more workers for the database with the oldest datfrozenxid (same with datminmxid), and because we explicitly don't start workers for other databases ("/* ignore not-at-risk DBs */"). That's especially bad if that database has a longrunning transaction preventing cleanup from happening - the vacuuming will not be able to advance the horizon, and all others are starved. - If there's a table "early in pg_class", and vacuum on it failes, it will prevent vacuuming other tables. It's pretty common for vacuum to fail often for a table, e.g. because it's sometimes exclusively locked, which then causes autovacuum to kill itself. There's absolutely no mechanism for autovacuum workers to skip over that table for a while, leading to all other tables in the database not being vacuumed, unless there happens to be second worker in the database, while the first vacuum hasn't failed. This obviously also applies to the corrupted index case. The 'kill itself' behaviour is exascerbated by lazy_truncate_heap()'s exclusive lock - which will obviously trigger other backend to send cancellation requests. There's unfortunately a number of places where that leads us to just throw all the work done away, and not update pg_class.relfrozenxid/datfrozenxid - Anti-wraparound vacuums are more impactful (don't cancel themselves upon lock conflicts, cause more IO, wait for cleanup locks), often emit scary messages ("oldest xmin is far in the past"). But we don't have *any* mechanism that avoids them for very common scenarios. E.g. for insert-mostly workloads, there'll never be enough dead tuples to cause a vacuum to happen before autovacuum_freeze_max_age is reached. That prevents vacuum_freeze_table_age from pre-empting the need to do an anti-wraparound vacuum, by increasing the xid horizon. We literally call anti-wraparound autovacuums "emergency" in the code, yet they're an almost inevitablepart of running postgres. - There's no meaningful in-database prioritization. Which means that autovacuum workers might happily vacuum the table just a bit over the thresholds, even though there's much worse tables around. Especially on a busy and large databases that can lead to anti-wraparound started launchers effectively never getting to vacuuming tables above autovacuum_freeze_max_age, because tables earlier in pg_class are modified heavily enough that they have dead tuples above the thresholds by the time vacuum finishes. To get to the anti-wraparound vacuum needing table, a single launcher needs to go through all tables preceding the table in pg_class that need vacuuming (only skipping over ones that are concurrently vacuumed by somebody else, but not ones that have *recently* been vacuumed). I kinda forgot how bad this one was until looking at the code again. And there are plenty more. My impression is that these are really hard to fix unless we develop a new scheduling approach. And that scheduling approach probably needs to be more stateful than the current code. IMO these are the main concerns for how work needs to be distributed: - work in different databases needs to be scheduled in a smarter way, in particular anti-wraparound cannot simply cause only the database with the to oldest datfrozenxid to be vacuumed until the wraparound "danger" is over (especially not if there's other database needing to be anti-wrap vacuumed) - tables within a database need to be prioritized in a smarter way, so databases with a lot of bloat get vacuumed before ones with a lot less bloat, and similarly tables with the oldest horizon need to be vacuumed before ones with newer horizons, even if all of the tables are above the autovacuum thresholds. - tables need to be prioritized across databases, to avoid problems like one currently vacuumed table causing unnecessary anti-wraparound workers to be launched for a database where they can't perform any work. - there needs to be a separate rung between normal autovacuums and anti-wraparound vacuums. The inbetween level does 'aggressive' vacuuming (so it freezes), even for tables with just inserts, but keeps interruptible, and doesn't have a scary 'anti-wraparound' name. - Repeated failures on one table may not starve all work for other tables in a database. - errors vacuuming one table may not cause vacuuming on tables with a lower priority from happening until that error is fixed. To me that means that we need prioritization across databases, and between tables, and probably by multiple criteria. I suspect there need to be multiple criteria how urgent vacuuming is, and autovacuum ought to try to make progress on all of them. I suspect the way to implement that would be to have avlauncher start workers for all databases independent of tasks needing to be done (but at a slower schedule than autovacuum_naptime). Those workers then would put the tables needing vacuuming in the database into the global prioritized list, displacing entries that are less important once the list is full. Proper autovacuuming work would then be done strictly in priority order. To avoid one priority criteria (e.g. age(relfrozenxid) from preventing any work done on other criteria (e.g. number of dead tuples), I suspect we ought to round robin between different criterias. E.g. have one list of tables needing vacuum ordered by age(relfrozenxid), one by %dead-tuples, and one by last-failure time, and have workers try to make progress on all. If that sounds familiar to what is being argued about around UNDO processing in workers, that'd not be an accident. A lesser, but still pretty pressing concern, is that we currently have no system that maintains the visibilitymap readily for insert-mostly tables. That means that a) index-only scans aren't going to happen for such tables, unless they're manually vacuumed b) the anti-wraparound vacuum that eventually is going to happen, is going to be extremely expensive/slow. Regards, Andres
On 17.07.2019 18:14, Andres Freund wrote: > > > To me that means that we need prioritization across databases, and > between tables, and probably by multiple criteria. I suspect there > need > to be multiple criteria how urgent vacuuming is, and autovacuum ought > to > try to make progress on all of them. > I'm not a contributor, and don't know the code base, so don't normally comment. But it occurs to me in the present case that there may be a short-term workaround for the current problem: could autovacuum visit tables in a random order? Then at least all tables would get vacuumed eventually (in the current case). If, later, more complex priorities are implemented, they could be probabilistic weights. Best, -- Shaun Cutts
> Can you tell us more about this index? Can you share its definition
> (i.e. what does \d show in psql)?
> Is it an expression index, or a partial index? A composite? What
> datatypes are indexed?
> (i.e. what does \d show in psql)?
> Is it an expression index, or a partial index? A composite? What
> datatypes are indexed?
It's a simple btree expression on a geometry(Point,4326) , no expression no partial no composite.
On Wed, Jul 17, 2019 at 3:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 17, 2019 at 10:27 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > It's possible that amcheck would have given you an accurate diagnosis
> > of the problem -- especially if you used bt_index_parent_check():
> >
> > https://www.postgresql.org/docs/current/amcheck.html
>
> BTW, be sure to use the 'heapallindexed' option with
> bt_index_parent_check() to detect missing downlinks, which is exactly
> the problem that VACUUM complained about.
Can you tell us more about this index? Can you share its definition
(i.e. what does \d show in psql)?
Is it an expression index, or a partial index? A composite? What
datatypes are indexed?
Thanks
--
Peter Geoghegan
On Thu, Jul 18, 2019 at 9:06 AM Aaron Pelz <aaronepelz@gmail.com> wrote: > It's a simple btree expression on a geometry(Point,4326) , no expression no partial no composite. The cause of the corruption may be a bug in a Postgis B-Tree operator class. I reported a bug in the Geography type that could lead to corrupt B-Tree indexes (not Geometry): https://trac.osgeo.org/postgis/ticket/3841 Though I also see what could be a comparable bug in Geometry: https://trac.osgeo.org/postgis/ticket/3777 These bugs are from about 3 years ago. If I'm right you should be able to isolate the bug using amcheck. -- Peter Geoghegan