Re: Index Tuple Compression Approach? - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Index Tuple Compression Approach?
Date
Msg-id 1187135907.24264.44.camel@dogma.ljc.laika.com
Whole thread Raw
In response to Re: Index Tuple Compression Approach?  (Decibel! <decibel@decibel.org>)
Responses Re: Index Tuple Compression Approach?
List pgsql-hackers
On Tue, 2007-08-14 at 16:27 -0500, Decibel! wrote:
> Isn't this what Grouped Index Tuples is?
> 

http://community.enterprisedb.com/git/git-readme.txt

It looks like GIT is a little different. 

GIT actually stores a lower-bound key of a contiguous* range of keys
that all point to the same page, and for each of those ranges stores a
bitmap of page offsets. A search searches first for an exact match in
the index, and failing that, looks to see if the previous index tuple
happens to be one of these ranges.

The algorithm Chris is talking about stores a set of tuple ids (which
include page and offset) for each distinct key.

Both could be helpful, although I don't think they can work together
very well.

GIT has the disadvantage that it's "lossy". It doesn't even store every
key in the index, so it can't be sure that the match actually is a
match. Thus, it returns candidate matches. That also has implications
for enforcing UNIQUE (although it's not impossible, according to the
readme). However, GIT can be used effectively on an index that happens
to be unique. GIT also assumes a tree structure, and makes no sense for
a hash index, and makes no sense for a types without ordering. GIT's
space savings is dependent on the clustering of the table.

Chris's suggestion would work on a UNIQUE index, but would be no help at
all, because there would be no duplicate keys to collapse. However, it
could be used for non-tree indexes. The space savings for this strategy
is dependent on how repetitive the keys are.

I guess the ultimate deciding factor is which can save you more space.
If you have lots of duplicates, Chris's suggestion might work better,
because you don't have to try to maintain cluster order. If you have a
wider distribution of data, GIT is probably better, although you have to
keep some degree of clustering (HOT may help with that).

Heikki is the authority on GIT, so I'm including him in the CC so he can
correct me :)

Regards,Jeff Davis

*: I'm not 100% sure I'm using "contiguous" correctly, but the range of
keys can contain gaps or duplicates, so long as every key in the range
points to that same page. That is, if keys 1,1,2,3,5 all point to page
P, they can be grouped into just "1" so long as there doesn't exist a
key 4 that points to a page other than P.



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: CVS corruption/mistagging?
Next
From: Tom Lane
Date:
Subject: Re: CVS corruption/mistagging?