The issue commonly referred to as 'btree index bloat' is better after
the release of 7.4 but still exists. What follows is a description
of the problem (for the non-developers or those that have forgotten),
how it can occur in real tables, current work-arounds (REINDEX), and a
possible solution that doesn't hold long-term exclusive locks. Skip to
the prototype solution section if you like. I'd like some comments on
from developers.
[Please CC me on any replies, I'll check the archives but I'm not
currently subscribed to this list, thanks.]
Problem Description
===================
The bloating issue is that when tuples are deleted index pages may
become sparsely populated and under some circumstances this can effect
the performance of the index. PostgreSQL-7.4 addressed part of this
problem by introducing the ability of VACUUM to identify and mark for
reuse completely free index pages. Before 7.4 the possible size of an
index was completely unbounded, under extreme circumstances you could
generate an arbitrary large index file with a very small set of rows;
now the worst you can possibly get is a single index key per index page,
8 KB of index per row (still very bad, but at least it's bounded now).
Deletes should not be a problem for an index of some arbitrary
general-use columns since future updates/inserts for index keys in the
same range can re-use the space in existing index pages. Indeed, this
is why REINDEX and CREATE INDEX don't pack the index pages full to
start with, to avoid needing to split pages right away on a following
insert or update. The problem arises if the insert and delete patterns
are such that this space is never reused. It's exactly this pattern
that many of the table in the schema I work with use, and the resulting
sparse btree indexes are causing us issues on production databases.
The only current solution for this is that once the index becomes sparse
REINDEX is run; this locks readers and writers from accessing the table.
CREATE INDEX/DROP INDEX can be used to build a replacement index for
non-primary key indices instead with the benefit that this still allows
readers, but it still locks out writers. With large tables either of
these operations can take hours which makes this undesirable (or in my
case unusable).
Real-Life Example
=================
We have tables that hold timestamped data. The data is inserted
with the current time stamp such that this field is monotonically
incrementing. Additionally, there are serial fields that also
monotonically increment. The granularity of the data during insert is
high (e.g. 1 row per item every 5 minutes) but as the data ages the
granularity can be lower (e.g. after a week we only need 1 row per item
every hour; and after a month we only need 1 row per item every day;
etc). The tables are designed such that the granularity can be changed
by simply deleting a subset of rows (e.g. delete 11 of the 12 rows
inserted for an item during an hour). Old rows are never updated.
Keys for indices starting with the time stamp or id field always get
inserted into the right most index page. The current btree code splits
this page 70/30 (instead of the normal 50/50) such that the after
repeated inserting the index pages are ~70% full of index keys (note
that a REINDEX fills index pages to 90%). As the data ages we delete a
subset of the old data which makes those pages sparse (or in some cases
empty, those are reused). Since no updates or inserts occur on old time
stamp or serial values these pages become and stay quite sparse. The
attached example SQL script has a case where indices pages become only
20% on average.
Prototype Solution
==================
The solution I've toyed with to improve performance and make the space
re-usable but NOT lock out readers or writers for any length of time is
a naive btree index page merger. This does the opposite of a index page
split. The merger/compressor scans the pages of the index and compares
each page with its right-page. If the items on the two pages can fit
in one page at a capacity less than some target amount (e.g. 90%) the
items from the left-page are moved into the right-page, the parent-page
updated to reflect this, and then all three pages written out.
The reason I call it naive is that it only considers pairs of pages,
not runs of several pages that could be merged (e.g. compress 3 pages
into 2). The code locks the three pages in question but because
I wasn't sure if this would be safe in all cases it also grabs an
exclusive lock on the table and index during the scan. However, unlike
re-index, the scan can be incremental, e.g. scan pages 1-1000 then
release the lock, then scan pages 1001-2000 and release the lock, etc,
so that writers are only periodically and briefly locked out.
After this a VACUUM of the table will put the now empty index pages on
the FSM (Free Space Map) for later re-use. (E.g. this doesn't make the
size of the index smaller yet, it just helps limit further growth and
increase performance).
Attached is an SQL script that generates a small table with several
sparse btree indexes that I've used to test the effectiveness of my
btree compressor. Comments in the script contain the numbers I've seen
and seem to indicate that my naive merge strategy works well.
Also attached is proto-type of the btree page compressor that can be
compiled as a shared object and added as a database function. The
biggest issue with it is that as a loadable object it can't generate
WAL records (since it would need a new type of record) therefore if the
database crashes before the next checkpoint the index will likely be
corrupt (a REINDEX should fix this). Also since there are no WAL logs
this won't play nice with WAL log archiving for PITR in 8.0. A shell
script would be used to call the function repeatedly to scan the whole
index a small chunk at a time (see README.btree_compress in the attached
tar file).
I'm looking for comments from the developers as to
* the desirability of doing such btree page merges,
* issues with the approach I'm using to implement it,
* if I'm doing the locking correctly,
* if the exclusive lock can be removed so that it doesn't need to
be incremental,
* if a patch to do this in the back-end would be welcome,
etc.
My goal is to take any comments provided and produce a patch that
implements this feature in the back-end with a new WAL record so
that it's safe during a database crash and as an easy to use command
(e.g. VACUUM INDEX foo, or COMPRESS INDEX foo). If the command still
needs to operate incrementally I'd like to do that all in the command
(like the way VACUUM and/or ANALYZE currently get and release locks
while running).
On a related note I don't see why a version of VACUUM FULL couldn't
be implemented to work incrementally with a series of shorter lived
exclusive locks as well.
--
Dave Chapeskie
OpenPGP Key ID: 0x3D2B6B34