Re: Index AM change proposals, redux - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Index AM change proposals, redux |
Date | |
Msg-id | 1209028262.4259.1455.camel@ebony.site Whole thread Raw |
In response to | Re: Index AM change proposals, redux (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: Index AM change proposals, redux
Re: Index AM change proposals, redux Re: Index AM change proposals, redux |
List | pgsql-hackers |
On Wed, 2008-04-23 at 22:26 -0400, Bruce Momjian wrote: > Simon Riggs wrote: > > GIT significantly reduces the size of clustered indexes, greatly > > improving the number of index pointers that can be held in memory for > > very large indexes. That translates directly into a reduction in I/O for > > large databases on typical hardware, for primary operations, file > > backups and recovery (and this, log replication). Test results validated > > that and showed increased performance, over and above that experienced > > with HOT, when tested together. > > > > Now there may be problems with the GIT code as it stands, but we should > > acknowledge that the general technique has been proven to improve > > performance on a recent PostgreSQL codebase. This is an unsurprising > > result, since SQLServer, Sybase, DB2, Oracle and Teradata (at least) all > > use indexes of this category to improve real-world performance. The idea > > is definitely not a benchmark-only feature. > > > > Many users would be very interested if we could significantly reduce the > > size of the main index on their largest tables. > > Yes, basically GIT allows index compression for clustered tables, and > stated that way it is clear it would be a big feature if we had it for > 8.4. > > I would at least like to see clustered indexes acknowledged as a TODO > > item, so we keep the door open for a future implementation based around > > the basic concept of GIT. > > Yes, it is already a TODO: > > * Consider compressing indexes by storing key values duplicated in > several rows as a single index entry That sounds similar, but definitely does not describe what GIT does. GIT holds one index pointer for a *range* of values held on one block. e.g. if we have a table with values 1-200 in it then GIT would store *one* index pointer for all 200 rows, even though the values are all different. That property makes it very different from the TODO item stated, since GIT can work very well on Unique indexes, which clearly do not have duplicated keys. I'd prefer the term "Clustered Indexes" rather than the name GIT, which is also a slang insult. Index compression is possible in many ways, depending upon the situation. All of the following sound similar at a high level, but each covers a different use case. * For Long, Similar data e.g. Text we can use Prefix Compression We still store one pointer per row, but we reduce the size of the index by reducing the size of the key values. This requires us to reach inside datatypes, so isn't a very general solution but is probably an important one in the future for Text. * For Unique/nearly-Unique indexes we can use Range Compression We reduce the size of the index by holding one index pointer per range of values, thus removing both keys and pointers. It's more efficient than prefix compression and isn't datatype-dependant. * For Highly Non-Unique Data we can use Duplicate Compression The latter is the technique used by Bitmap Indexes. Efficient, but not useful for unique/nearly-unique data * Multi-Column Leading Value Compression - if you have a multi-column index, then leading columns are usually duplicated between rows inserted at the same time. Using an on-block dictionary we can remove duplicates. Only useful for multi-column indexes, possibly overlapping/contained subset of the GIT use case. As with HOT, all of these techniques need to be controlled by heuristics. Taken to extremes, these techniques can hurt other aspects of performance, so its important we don't just write the code but also investigate reasonable limits for behaviour also. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
pgsql-hackers by date: