Plans for solving the VACUUM problem - Mailing list pgsql-hackers

From Tom Lane
Subject Plans for solving the VACUUM problem
Date
Msg-id 12833.990140724@sss.pgh.pa.us
Whole thread Raw
Responses Re: Plans for solving the VACUUM problem  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: Plans for solving the VACUUM problem  (Tim Allen <tim@proximity.com.au>)
Re: Plans for solving the VACUUM problem  (Philip Warner <pjw@rhyme.com.au>)
Re: Plans for solving the VACUUM problem  (Oleg Bartunov <oleg@sai.msu.su>)
Re: Plans for solving the VACUUM problem  (Kaare Rasmussen <kar@webline.dk>)
List pgsql-hackers
I have been thinking about the problem of VACUUM and how we might fix it
for 7.2.  Vadim has suggested that we should attack this by implementing
an overwriting storage manager and transaction UNDO, but I'm not totally
comfortable with that approach: it seems to me that it's an awfully large
change in the way Postgres works.  Instead, here is a sketch of an attack
that I think fits better into the existing system structure.

First point: I don't think we need to get rid of VACUUM, exactly.  What
we want for 24x7 operation is to be able to do whatever housekeeping we
need without locking out normal transaction processing for long intervals.
We could live with routine VACUUMs if they could run in parallel with
reads and writes of the table being vacuumed.  They don't even have to run
in parallel with schema updates of the target table (CREATE/DROP INDEX,
ALTER TABLE, etc).  Schema updates aren't things you do lightly for big
tables anyhow.  So what we want is more of a "background VACUUM" than a
"no VACUUM" solution.

Second: if VACUUM can run in the background, then there's no reason not
to run it fairly frequently.  In fact, it could become an automatically
scheduled activity like CHECKPOINT is now, or perhaps even a continuously
running daemon (which was the original conception of it at Berkeley, BTW).
This is important because it means that VACUUM doesn't have to be perfect.
The existing VACUUM code goes to huge lengths to ensure that it compacts
the table as much as possible.  We don't need that; if we miss some free
space this time around, but we can expect to get it the next time (or
eventually), we can be happy.  This leads to thinking of space management
in terms of steady-state behavior, rather than the periodic "big bang"
approach that VACUUM represents now.

But having said that, there's no reason to remove the existing VACUUM
code: we can keep it around for situations where you need to crunch a
table as much as possible and you can afford to lock the table while
you do it.  The new code would be a new command, maybe "VACUUM LAZY"
(or some other name entirely).

Enough handwaving, what about specifics?

1. Forget moving tuples from one page to another.  Doing that in a
transaction-safe way is hugely expensive and complicated.  Lazy VACUUM
will only delete dead tuples and coalesce the free space thus made
available within each page of a relation.

2. This does no good unless there's a provision to re-use that free space.
To do that, I propose a free space map (FSM) kept in shared memory, which
will tell backends which pages of a relation have free space.  Only if the
FSM shows no free space available will the relation be extended to insert
a new or updated tuple.

3. Lazy VACUUM processes a table in five stages:  A. Scan relation looking for dead tuples; accumulate a list of their
  TIDs, as well as info about existing free space.  (This pass is     completely read-only and so incurs no WAL
traffic.) B. Remove index entries for the dead tuples.  (See below for details.)  C. Physically delete dead tuples and
compactfree space on their pages.  D. Truncate any completely-empty pages at relation's end.  (Optional,     see
below.) E. Create/update FSM entry for the table.
 
Note that this is crash-safe as long as the individual update operations
are atomic (which can be guaranteed by WAL entries for them).  If a tuple
is dead, we care not whether its index entries are still around or not;
so there's no risk to logical consistency.

4. Observe that lazy VACUUM need not really be a transaction at all, since
there's nothing it does that needs to be cancelled or undone if it is
aborted.  This means that its WAL entries do not have to hang around past
the next checkpoint, which solves the huge-WAL-space-usage problem that
people have noticed while VACUUMing large tables under 7.1.

5. Also note that there's nothing saying that lazy VACUUM must do the
entire table in one go; once it's accumulated a big enough batch of dead
tuples, it can proceed through steps B,C,D,E even though it's not scanned
the whole table.  This avoids a rather nasty problem that VACUUM has
always had with running out of memory on huge tables.


Free space map details
----------------------

I envision the FSM as a shared hash table keyed by table ID, with each
entry containing a list of page numbers and free space in each such page.

The FSM is empty at system startup and is filled by lazy VACUUM as it
processes each table.  Backends then decrement/remove page entries as they
use free space.

Critical point: the FSM is only a hint and does not have to be perfectly
accurate.  It can omit space that's actually available without harm, and
if it claims there's more space available on a page than there actually
is, we haven't lost much except a wasted ReadBuffer cycle.  This allows
us to take shortcuts in maintaining it.  In particular, we can constrain
the FSM to a prespecified size, which is critical for keeping it in shared
memory.  We just discard entries (pages or whole relations) as necessary
to keep it under budget.  Obviously, we'd not bother to make entries in
the first place for pages with only a little free space.  Relation entries
might be discarded on a least-recently-used basis.

Accesses to the FSM could create contention problems if we're not careful.
I think this can be dealt with by having each backend remember (in its
relcache entry for a table) the page number of the last page it chose from
the FSM to insert into.  That backend will keep inserting new tuples into
that same page, without touching the FSM, as long as there's room there.
Only then does it go back to the FSM, update or remove that page entry,
and choose another page to start inserting on.  This reduces the access
load on the FSM from once per tuple to once per page.  (Moreover, we can
arrange that successive backends consulting the FSM pick different pages
if possible.  Then, concurrent inserts will tend to go to different pages,
reducing contention for shared buffers; yet any single backend does
sequential inserts in one page, so that a bulk load doesn't cause
disk traffic scattered all over the table.)

The FSM can also cache the overall relation size, saving an lseek kernel
call whenever we do have to extend the relation for lack of internal free
space.  This will help pay for the locking cost of accessing the FSM.


Locking issues
--------------

We will need two extensions to the lock manager:

1. A new lock type that allows concurrent reads and writes
(AccessShareLock, RowShareLock, RowExclusiveLock) but not anything else.
Lazy VACUUM will grab this type of table lock to ensure the table schema
doesn't change under it.  Call it a VacuumLock until we think of a better
name.

2. A "conditional lock" operation that acquires a lock if available, but
doesn't block if not.

The conditional lock will be used by lazy VACUUM to try to upgrade its
VacuumLock to an AccessExclusiveLock at step D (truncate table).  If it's
able to get exclusive lock, it's safe to truncate any unused end pages.
Without exclusive lock, it's not, since there might be concurrent
transactions scanning or inserting into the empty pages.  We do not want
lazy VACUUM to block waiting to do this, since if it does that it will
create a lockout situation (reader/writer transactions will stack up
behind it in the lock queue while everyone waits for the existing
reader/writer transactions to finish).  Better to not do the truncation.

Another place where lazy VACUUM may be unable to do its job completely
is in compaction of space on individual disk pages.  It can physically
move tuples to perform compaction only if there are not currently any
other backends with pointers into that page (which can be tested by
looking to see if the buffer reference count is one).  Again, we punt
and leave the space to be compacted next time if we can't do it right
away.

The fact that inserted/updated tuples might wind up anywhere in the table,
not only at the end, creates no headaches except for heap_update.  That
routine needs buffer locks on both the page containing the old tuple and
the page that will contain the new.  To avoid possible deadlocks between
different backends locking the same two pages in opposite orders, we need
to constrain the lock ordering used by heap_update.  This is doable but
will require slightly more code than is there now.


Index access method improvements
--------------------------------

Presently, VACUUM deletes index tuples by doing a standard index scan
and checking each returned index tuple to see if it points at any of
the tuples to be deleted.  If so, the index AM is called back to delete
the tested index tuple.  This is horribly inefficient: it means one trip
into the index AM (with associated buffer lock/unlock and search overhead)
for each tuple in the index, plus another such trip for each tuple actually
deleted.

This is mainly a problem of a poorly chosen API.  The index AMs should
offer a "bulk delete" call, which is passed a sorted array of main-table
TIDs.  The loop over the index tuples should happen internally to the
index AM.  At least in the case of btree, this could be done by a
sequential scan over the index pages, which avoids the random I/O of an
index-order scan and so should offer additional speedup.

Further out (possibly not for 7.2), we should also look at making the
index AMs responsible for shrinking indexes during deletion, or perhaps
via a separate "vacuum index" API.  This can be done without exclusive
locks on the index --- the original Lehman & Yao concurrent-btrees paper
didn't describe how, but more recent papers show how to do it.  As with
the main tables, I think it's sufficient to recycle freed space within
the index, and not necessarily try to give it back to the OS.

We will also want to look at upgrading the non-btree index types to allow
concurrent operations.  This may be a research problem; I don't expect to
touch that issue for 7.2.  (Hence, lazy VACUUM on tables with non-btree
indexes will still create lockouts until this is addressed.  But note that
the lockout only lasts through step B of the VACUUM, not the whole thing.)


There you have it.  If people like this, I'm prepared to commit to
making it happen for 7.2.  Comments, objections, better ideas?
        regards, tom lane


pgsql-hackers by date:

Previous
From: ncm@zembu.com (Nathan Myers)
Date:
Subject: Re: Re: "End-to-end" paper
Next
From: Tom Lane
Date:
Subject: Re: Re: [COMMITTERS] pgsql/ oc/src/sgml/runtime.sgml rc/backend/uti ...