btree index bloat, prototype non-locking solution - Mailing list pgsql-general
From | Dave Chapeskie |
---|---|
Subject | btree index bloat, prototype non-locking solution |
Date | |
Msg-id | 20050329191017.GA97659@ddm.wox.org Whole thread Raw |
List | pgsql-general |
The issue commonly referred to as 'btree index bloat' is better after the release of 7.4 but still exists. What follows is a description of the problem (for the non-developers or those that have forgotten), how it can occur in real tables, current work-arounds (REINDEX), and a possible solution that doesn't hold long-term exclusive locks. Skip to the prototype solution section if you like. I'd like some comments on from developers. [Please CC me on any replies, I'll check the archives but I'm not currently subscribed to this list, thanks.] Problem Description =================== The bloating issue is that when tuples are deleted index pages may become sparsely populated and under some circumstances this can effect the performance of the index. PostgreSQL-7.4 addressed part of this problem by introducing the ability of VACUUM to identify and mark for reuse completely free index pages. Before 7.4 the possible size of an index was completely unbounded, under extreme circumstances you could generate an arbitrary large index file with a very small set of rows; now the worst you can possibly get is a single index key per index page, 8 KB of index per row (still very bad, but at least it's bounded now). Deletes should not be a problem for an index of some arbitrary general-use columns since future updates/inserts for index keys in the same range can re-use the space in existing index pages. Indeed, this is why REINDEX and CREATE INDEX don't pack the index pages full to start with, to avoid needing to split pages right away on a following insert or update. The problem arises if the insert and delete patterns are such that this space is never reused. It's exactly this pattern that many of the table in the schema I work with use, and the resulting sparse btree indexes are causing us issues on production databases. The only current solution for this is that once the index becomes sparse REINDEX is run; this locks readers and writers from accessing the table. CREATE INDEX/DROP INDEX can be used to build a replacement index for non-primary key indices instead with the benefit that this still allows readers, but it still locks out writers. With large tables either of these operations can take hours which makes this undesirable (or in my case unusable). Real-Life Example ================= We have tables that hold timestamped data. The data is inserted with the current time stamp such that this field is monotonically incrementing. Additionally, there are serial fields that also monotonically increment. The granularity of the data during insert is high (e.g. 1 row per item every 5 minutes) but as the data ages the granularity can be lower (e.g. after a week we only need 1 row per item every hour; and after a month we only need 1 row per item every day; etc). The tables are designed such that the granularity can be changed by simply deleting a subset of rows (e.g. delete 11 of the 12 rows inserted for an item during an hour). Old rows are never updated. Keys for indices starting with the time stamp or id field always get inserted into the right most index page. The current btree code splits this page 70/30 (instead of the normal 50/50) such that the after repeated inserting the index pages are ~70% full of index keys (note that a REINDEX fills index pages to 90%). As the data ages we delete a subset of the old data which makes those pages sparse (or in some cases empty, those are reused). Since no updates or inserts occur on old time stamp or serial values these pages become and stay quite sparse. The attached example SQL script has a case where indices pages become only 20% on average. Prototype Solution ================== The solution I've toyed with to improve performance and make the space re-usable but NOT lock out readers or writers for any length of time is a naive btree index page merger. This does the opposite of a index page split. The merger/compressor scans the pages of the index and compares each page with its right-page. If the items on the two pages can fit in one page at a capacity less than some target amount (e.g. 90%) the items from the left-page are moved into the right-page, the parent-page updated to reflect this, and then all three pages written out. The reason I call it naive is that it only considers pairs of pages, not runs of several pages that could be merged (e.g. compress 3 pages into 2). The code locks the three pages in question but because I wasn't sure if this would be safe in all cases it also grabs an exclusive lock on the table and index during the scan. However, unlike re-index, the scan can be incremental, e.g. scan pages 1-1000 then release the lock, then scan pages 1001-2000 and release the lock, etc, so that writers are only periodically and briefly locked out. After this a VACUUM of the table will put the now empty index pages on the FSM (Free Space Map) for later re-use. (E.g. this doesn't make the size of the index smaller yet, it just helps limit further growth and increase performance). Attached is an SQL script that generates a small table with several sparse btree indexes that I've used to test the effectiveness of my btree compressor. Comments in the script contain the numbers I've seen and seem to indicate that my naive merge strategy works well. Also attached is proto-type of the btree page compressor that can be compiled as a shared object and added as a database function. The biggest issue with it is that as a loadable object it can't generate WAL records (since it would need a new type of record) therefore if the database crashes before the next checkpoint the index will likely be corrupt (a REINDEX should fix this). Also since there are no WAL logs this won't play nice with WAL log archiving for PITR in 8.0. A shell script would be used to call the function repeatedly to scan the whole index a small chunk at a time (see README.btree_compress in the attached tar file). I'm looking for comments from the developers as to * the desirability of doing such btree page merges, * issues with the approach I'm using to implement it, * if I'm doing the locking correctly, * if the exclusive lock can be removed so that it doesn't need to be incremental, * if a patch to do this in the back-end would be welcome, etc. My goal is to take any comments provided and produce a patch that implements this feature in the back-end with a new WAL record so that it's safe during a database crash and as an easy to use command (e.g. VACUUM INDEX foo, or COMPRESS INDEX foo). If the command still needs to operate incrementally I'd like to do that all in the command (like the way VACUUM and/or ANALYZE currently get and release locks while running). On a related note I don't see why a version of VACUUM FULL couldn't be implemented to work incrementally with a series of shorter lived exclusive locks as well. -- Dave Chapeskie OpenPGP Key ID: 0x3D2B6B34
Attachment
pgsql-general by date: