Re: Index AM change proposals, redux - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Index AM change proposals, redux
Date
Msg-id 1209037950.4259.1522.camel@ebony.site
Whole thread Raw
In response to Re: Index AM change proposals, redux  (Martijn van Oosterhout <kleptog@svana.org>)
List pgsql-hackers
On Thu, 2008-04-24 at 13:13 +0200, Martijn van Oosterhout wrote:
> On Thu, Apr 24, 2008 at 10:11:02AM +0100, Simon Riggs wrote:
> > Index compression is possible in many ways, depending upon the
> > situation. All of the following sound similar at a high level, but each
> > covers a different use case.
> 
> True, but there is one significant difference:

> > * For Long, Similar data e.g. Text we can use Prefix Compression
> > * For Highly Non-Unique Data we can use Duplicate Compression
> > * Multi-Column Leading Value Compression - if you have a multi-column
> 
> These are all not lossy and so are candidate to use on any b-tree even
> by default. They don't affect plan-construction materially, except
> perhaps in cost calculations. Given the index tuple overhead I don't
> see how you could lose.

True, but it is important that this is not a discussion about the one
best technique for index compression. I think there are many techniques,
covering many use cases. If we think there is only one, the different
use cases get thrown out the door because "there was no consensus". If
the TODO list does not say clearly what we want, it will never happen.
We know no sponsor will pay for an idea that didn't make the list.
Worse, they will pay for what is on the list, even if it wasn't exactly
what we wanted.

> > * For Unique/nearly-Unique indexes we can use Range Compression
> 
> This one is lossy and so does affect possible plans. I beleive the term
> for this is "sparse" index. Also, it's seems harder to optimise, since
> it would seem to me the index would have to have some idea on how
> "close" values are together.

Value Lossy, true, but we already support lossy searches with
BitmapHeapScans, so lets not be fooled by how that word sounds. That was
Tom's original complaint against Block Indexes in Sep 2006, which is why
the non-lossy GIT design was conceived. GIT knows exactly which tuples
are indexed, and which are not. 

There is no need for the index to know how close the values are. Only
that values in that range live on that block.

> > As with HOT, all of these techniques need to be controlled by
> > heuristics. Taken to extremes, these techniques can hurt other aspects
> > of performance, so its important we don't just write the code but also
> > investigate reasonable limits for behaviour also.
> 
> For the first three I can't imagine what the costs would be, except the
> memory usage to store the unduplicated data once when its read in.
> 
> Have a nice day,
--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Batch update of indexes on data loading
Next
From: Gregory Stark
Date:
Subject: Re: Proposed patch - psql wraps at window width