Re: Corrupt index stopping autovacuum system wide - Mailing list pgsql-general

From Andres Freund
Subject Re: Corrupt index stopping autovacuum system wide
Date
Msg-id 20190717221421.muxfrw2o2iai4xk6@alap3.anarazel.de
Whole thread Raw
In response to Re: Corrupt index stopping autovacuum system wide  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Corrupt index stopping autovacuum system wide
List pgsql-general
Hi,

On 2019-07-17 13:27:23 -0400, Tom Lane wrote:
> My concern here is that if we have blinders on to the extent of only
> processing that one table or DB, we're unnecessarily allowing bloat to
> occur in other tables, and causing that missed vacuuming work to pile
> up so that there's more of it to be done once the breakage is cleared.

That a pretty common problem in the real world, with or without problems
like corrupted indexes. Autovacuum's scheduling is just too simplistic
to avoid that.

Common problems:

- If one table in one database has an xmin older than
  autovacuum_freeze_max_age, the autovacuum launcher continuously throws
  more workers at that database. Even though there's a lot of work to be
  done in other databases.

  That's because do_start_worker() ignores the "adl_next_worker"
  mechanism, and *always* starts more workers for the database with the
  oldest datfrozenxid (same with datminmxid), and because we explicitly
  don't start workers for other databases ("/* ignore not-at-risk DBs
  */").

  That's especially bad if that database has a longrunning transaction
  preventing cleanup from happening - the vacuuming will not be able to
  advance the horizon, and all others are starved.


- If there's a table "early in pg_class", and vacuum on it failes, it
  will prevent vacuuming other tables. It's pretty common for vacuum to
  fail often for a table, e.g. because it's sometimes exclusively locked,
  which then causes autovacuum to kill itself.  There's absolutely no
  mechanism for autovacuum workers to skip over that table for a while,
  leading to all other tables in the database not being vacuumed, unless
  there happens to be second worker in the database, while the first
  vacuum hasn't failed.

  This obviously also applies to the corrupted index case.

  The 'kill itself' behaviour is exascerbated by lazy_truncate_heap()'s
  exclusive lock - which will obviously trigger other backend to send
  cancellation requests. There's unfortunately a number of places where
  that leads us to just throw all the work done away, and not update
  pg_class.relfrozenxid/datfrozenxid


- Anti-wraparound vacuums are more impactful (don't cancel themselves
  upon lock conflicts, cause more IO, wait for cleanup locks), often
  emit scary messages ("oldest xmin is far in the past").  But we don't
  have *any* mechanism that avoids them for very common scenarios.

  E.g. for insert-mostly workloads, there'll never be enough dead tuples
  to cause a vacuum to happen before autovacuum_freeze_max_age is
  reached. That prevents vacuum_freeze_table_age from pre-empting the
  need to do an anti-wraparound vacuum, by increasing the xid horizon.

  We literally call anti-wraparound autovacuums "emergency" in the code,
  yet they're an almost inevitablepart of running postgres.


- There's no meaningful in-database prioritization. Which means that
  autovacuum workers might happily vacuum the table just a bit over the
  thresholds, even though there's much worse tables around.

  Especially on a busy and large databases that can lead to
  anti-wraparound started launchers effectively never getting to
  vacuuming tables above autovacuum_freeze_max_age, because tables
  earlier in pg_class are modified heavily enough that they have dead
  tuples above the thresholds by the time vacuum finishes.  To get to
  the anti-wraparound vacuum needing table, a single launcher needs to
  go through all tables preceding the table in pg_class that need
  vacuuming (only skipping over ones that are concurrently vacuumed by
  somebody else, but not ones that have *recently* been vacuumed).

  I kinda forgot how bad this one was until looking at the code again.

And there are plenty more.


My impression is that these are really hard to fix unless we develop a
new scheduling approach. And that scheduling approach probably needs to
be more stateful than the current code.

IMO these are the main concerns for how work needs to be distributed:

- work in different databases needs to be scheduled in a smarter way, in
  particular anti-wraparound cannot simply cause only the database with
  the to oldest datfrozenxid to be vacuumed until the wraparound
  "danger" is over (especially not if there's other database needing to
  be anti-wrap vacuumed)

- tables within a database need to be prioritized in a smarter way, so
  databases with a lot of bloat get vacuumed before ones with a lot less
  bloat, and similarly tables with the oldest horizon need to be
  vacuumed before ones with newer horizons, even if all of the tables
  are above the autovacuum thresholds.

- tables need to be prioritized across databases, to avoid problems like
  one currently vacuumed table causing unnecessary anti-wraparound
  workers to be launched for a database where they can't perform any
  work.

- there needs to be a separate rung between normal autovacuums and
  anti-wraparound vacuums. The inbetween level does 'aggressive'
  vacuuming (so it freezes), even for tables with just inserts, but
  keeps interruptible, and doesn't have a scary 'anti-wraparound' name.

- Repeated failures on one table may not starve all work for other
  tables in a database.

- errors vacuuming one table may not cause vacuuming on tables with a
  lower priority from happening until that error is fixed.



To me that means that we need prioritization across databases, and
between tables, and probably by multiple criteria. I suspect there need
to be multiple criteria how urgent vacuuming is, and autovacuum ought to
try to make progress on all of them.


I suspect the way to implement that would be to have avlauncher start
workers for all databases independent of tasks needing to be done (but
at a slower schedule than autovacuum_naptime). Those workers then would
put the tables needing vacuuming in the database into the global
prioritized list, displacing entries that are less important once the
list is full.  Proper autovacuuming work would then be done strictly in
priority order.

To avoid one priority criteria (e.g. age(relfrozenxid) from preventing
any work done on other criteria (e.g. number of dead tuples), I suspect
we ought to round robin between different criterias.  E.g. have one list
of tables needing vacuum ordered by age(relfrozenxid), one by
%dead-tuples, and one by last-failure time, and have workers try to make
progress on all.

If that sounds familiar to what is being argued about around UNDO
processing in workers, that'd not be an accident.




A lesser, but still pretty pressing concern, is that we currently have
no system that maintains the visibilitymap readily for insert-mostly
tables. That means that a) index-only scans aren't going to happen for
such tables, unless they're manually vacuumed b) the anti-wraparound
vacuum that eventually is going to happen, is going to be extremely
expensive/slow.


Regards,

Andres



pgsql-general by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Corrupt index stopping autovacuum system wide
Next
From: shauncutts@factfiber.com
Date:
Subject: Re: Corrupt index stopping autovacuum system wide