Thread: prefix btree implementation

prefix btree implementation

From

Qingqing Zhou

Date:

05 October 2005, 01:00:27

I am not sure if this idea was mentioned before.

The basic prefix btree idea is quite straightforward, i.e., try to
compress the key items within a data page by sharing the common prefix.
Thus the fanout of the page is increased and the benefits is obvious
theorectically.

Consider the following cases: 1/ Multi-attributes index, example page
looks like this:   [1, 1, 'abc'], [1, 1, 'bde'], ..., [1, 22, 'mmk']   - So [1, 1] or [1] is compressible

2/ Index on string, example page looks like this:   ['aaab'] ['aaac'], ..., ['aabc']   - So ['aaa'] or ['aa'] is
compressible

3/ Both 1 and 2, example page looks like this:   [1, 'abc', 1], [1, 'abc', 2], ..., [1, 'abk', 1]   - So [1, 'abc'] or
[1,'ab'] is compressible
 

4/ And even this, example page looks like this:   [0x00001234], [0x00001235], ..., [0x00002213]   - So [0x0000] is
compressible

So together, there are basically four types of possible sharing:
column-wise (case 1), character-wise (case 2), column-character-wise (case
3), and byte-wise (case 4).

To compress these prefixes, we have the following questions:

1/ What types of prefix compression shall we support?

We can consider case 3 as a generalized case of 1 and 2, thus at least we
can implement for case 1, 2 and 3 now. Case 4 could be seen as a special
case of case 2, so the framework should be scalable enough to add supports
to 4. In short, we will implement 1,2 and 3 now.

2/ When to do prefix compression and what index operations will be
affected?

To simplify things, we can do it when we do bottom-up index rebuild. We
won't try to compress the prefix when inserting new data. By this design,
I could think that the index operations affected will include btree build,
btree split (new page has to inheritant prefix information).

3/ How to represent each index tuple and where to store the common prefix?

Index tuples involved in the compression have to remember how much have
been shared and the shared prefix should be stored within the same page.
The basic idea is that we will use bit 13 of IndexTuple->t_info to
indicate that if it shares a prefix or not, and the common prefix will be
stored in the special area of an index page. The common prefix data will
have format {length|value}.  Thanks that the PageHeader->pd_special is
dyanmic, so the common prefix could have different length on each page.
Based on these changes, we will change index_getattr() and _bt_pageinit()
accordingly.

4/ WAL support?

We only affect btree build and btree split. For btree build, we don't
worry too much, since the data is already page-wise xloged. We will add
prefix information to btree split xlog record (xl_btree_split).

Comments?

Regards,
Qingqing

Re: prefix btree implementation

From

Tom Lane

Date:

05 October 2005, 03:47:55

Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> 1/ What types of prefix compression shall we support?

Given the requirement of datatype independence, this idea seems a
complete nonstarter to me...
        regards, tom lane

Re: prefix btree implementation

From

Alvaro Herrera

Date:

05 October 2005, 09:39:00

On Wed, Oct 05, 2005 at 12:50:41AM -0400, Tom Lane wrote:
> Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> > 1/ What types of prefix compression shall we support?
> 
> Given the requirement of datatype independence, this idea seems a
> complete nonstarter to me...

How about having each type optionally provide the required routines?
Thus we could provide them at least for the most common datatypes, and
the system would continue working as currently for the rest (including
user-defined types).  Cross-column prefixes would be hard to handle I
guess, as well as TOASTed data.

One problem I do see is what happens if I need to insert a new tuple in
the page that doesn't share the prefix.  It obviously would have to be
the leftmost or rightmost item on the page, but it's possible.

-- 
Alvaro Herrera                                http://www.PlanetPostgreSQL.org
A male gynecologist is like an auto mechanic who never owned a car.
(Carrie Snow)

Re: prefix btree implementation

From

Bricklen Anderson

Date:

05 October 2005, 11:57:21

Qingqing Zhou wrote:
> I am not sure if this idea was mentioned before.
> 
> The basic prefix btree idea is quite straightforward, i.e., try to
> compress the key items within a data page by sharing the common prefix.
> Thus the fanout of the page is increased and the benefits is obvious
> theorectically.
> 
<snip>
> 
> So together, there are basically four types of possible sharing:
> column-wise (case 1), character-wise (case 2), column-character-wise (case
> 3), and byte-wise (case 4).
> 
Oracle implements something similar called index compression, but I believe it
is only for common column values. I haven't checked in versions>9r1 so maybe
there are other options implemented by now.

Jonathan Lewis describes some pros and cons here:
http://www.jlcomp.demon.co.uk/faq/compress_ind.html

-- 
_______________________________

This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
_______________________________

Re: prefix btree implementation

From

"Qingqing Zhou"

Date:

05 October 2005, 14:14:51

"Bricklen Anderson" <BAnderson@PresiNET.com> wrote
>
> Oracle implements something similar called index compression, but I 
> believe it
> is only for common column values. I haven't checked in versions>9r1 so 
> maybe
> there are other options implemented by now.
>
> Jonathan Lewis describes some pros and cons here:
> http://www.jlcomp.demon.co.uk/faq/compress_ind.html
>

Oracle 9 uses the grammar like this:
   CREATE INDEX ... [ COMPRESS <number_of_first_columns> ]

So it gives the flexibility of choosing optimal number of coulumns to the 
user. The script mentioned in the article guesses the optimal number by 
estimating the size of each choice. But I am thinking we can do it better: 
(1) we don't require that the compressed number of columns on each page are 
the same; (2) when we build up index bottom-up, we can determine this number 
for each page automatically by maximizing the number of items within a page.

Regards,
Qingqing

Re: prefix btree implementation

From

"Qingqing Zhou"

Date:

05 October 2005, 16:40:53

"Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote
> On Wed, Oct 05, 2005 at 12:50:41AM -0400, Tom Lane wrote:
>> Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
>> > 1/ What types of prefix compression shall we support?
>>
>> Given the requirement of datatype independence, this idea seems a
>> complete nonstarter to me...
>
> How about having each type optionally provide the required routines?
> Thus we could provide them at least for the most common datatypes, and
> the system would continue working as currently for the rest (including
> user-defined types).  Cross-column prefixes would be hard to handle I
> guess, as well as TOASTed data.
>

Yes, column-wise should not be difficult since it does require no knowledge 
of the data types.

We can treat cross-column or incomplete-column share in two ways. One way 
(binary-comparison) is that we just compare the items binaryly, i.e., 
without knowing what in fact is stored. The other way (datatype-comparison) 
is we compare the items with some knowlege of the data types. For example, 
suppose the index is defined on a varchar column and the examplar data look 
like this:
   {3|'aaa'}{4|'aaab'}{5|'aaabc'}

The binary-comparison way can't share prefix 'aaa' at all, because it will 
see 3, 4 and 5 are totally different. The datatype-comparison way can share 
the prefix 'aaa', but it has to know that each varchar column is associated 
with a length header. When it compares, it ignores the header. The first way 
is easy to implement, and works for some data types like integers, but no 
acceptable I guess, since it even does not support varchar column prefix 
sharing.

We can find a way to handle the above case, but it is better to find a 
general way to handle any data types(include UDT). "Each type optionally 
provide the required routines" could be a way, more details?

> One problem I do see is what happens if I need to insert a new tuple in
> the page that doesn't share the prefix.  It obviously would have to be
> the leftmost or rightmost item on the page, but it's possible.
>

We do the prefix sharing when we build up index only, never on the fly.

Regards,
Qingqing

Resultset duplicates (was Re: prefix btree implementation)

From

Richard Huxton

Date:

05 October 2005, 18:00:35

Qingqing Zhou wrote:
> Oracle 9 uses the grammar like this:
> 
>     CREATE INDEX ... [ COMPRESS <number_of_first_columns> ]
> 
> So it gives the flexibility of choosing optimal number of coulumns to the 
> user. The script mentioned in the article guesses the optimal number by 
> estimating the size of each choice. But I am thinking we can do it better: 
> (1) we don't require that the compressed number of columns on each page are 
> the same; (2) when we build up index bottom-up, we can determine this number 
> for each page automatically by maximizing the number of items within a page.

Are there any gains in eliminating duplicate values in result-sets? I'd 
guess that many/most large result-sets are sorted which should make it 
possible to get away with a "same as last row" marker when the whole set 
is returned to a client.

Of course, this is where someone turns around and tells me we do this 
already :-)

--  Richard Huxton  Archonet Ltd

Re: prefix btree implementation

From

Junji TERAMOTO

Date:

05 October 2005, 23:42:53

Hello all,

I also was examining a similar compression method just.

Qingqing Zhou wrote:
> We can find a way to handle the above case, but it is better to find a 
> general way to handle any data types(include UDT). "Each type optionally 
> provide the required routines" could be a way, more details?

How about the use of difference information with "High key"?
Because "High key" information exists also in the route page, I think
that it seems to be able to use it well. (e.g. new tuple insertion)
# There is a problem that in the rightmost page, "High key" is not...

My consideration was just started, whether it goes well really has not
been understood yet.

---
Junji Teramoto

Re: prefix btree implementation

From

Simon Riggs

Date:

06 October 2005, 04:53:20

On Wed, 2005-10-05 at 00:50 -0400, Tom Lane wrote:
> Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> > 1/ What types of prefix compression shall we support?
> 
> Given the requirement of datatype independence, this idea seems a
> complete nonstarter to me...

It would be possible to compress on similar values, since we know the
output of the comparison in the final stage of the sort of the index
build. That wouldn't need to rely upon anything to do with the datatype,
since "they are equal" is a fact outside the encapsulation, and is
arrived at by use of the datatype's own comparison logic.

But that isn't prefix compression, just compression.

But why do we want this? Its very easy to work out a data-aware
prefixing or compression scheme and then encapulate that in a function.
The function can then be used in a functional index and the usage hidden
behind a view.

It might be worth teaching the optimiser that if it has an index on an
immutable function that if we have WHERE x = k and a functional index on
f(x) then we can access the functional index with 
f(x) = f(k), as long as we also reapply the original WHERE clause. 

Best Regards, Simon Riggs

Re: prefix btree implementation

From

Tom Lane

Date:

06 October 2005, 10:41:34

Simon Riggs <simon@2ndquadrant.com> writes:
> It might be worth teaching the optimiser that if it has an index on an
> immutable function that if we have WHERE x = k and a functional index on
> f(x) then we can access the functional index with 
> f(x) = f(k), as long as we also reapply the original WHERE clause. 

As I just pointed out to Gaetano, this is utterly wrong.  We can't
assume that much about the behavior of equality.
        regards, tom lane

Re: prefix btree implementation

From

Martijn van Oosterhout

Date:

06 October 2005, 10:58:19

On Thu, Oct 06, 2005 at 08:53:25AM +0100, Simon Riggs wrote:
> It would be possible to compress on similar values, since we know the
> output of the comparison in the final stage of the sort of the index
> build. That wouldn't need to rely upon anything to do with the datatype,
> since "they are equal" is a fact outside the encapsulation, and is
> arrived at by use of the datatype's own comparison logic.

Well, one thing I would be curious about would be when you index a
multicolumn index, not storing the value of each column in each index
entry? Couldn't there be a btree on the first key, then at the leaf a
pointer to a new btree on the second key, etc.

It would save a lot of space in large multicolumn indexes, no? And
since ctid is an automatic member of each indexkey, it could
automatically remove common key elements.

Codingwise however, it sucks. The whole index_getattr can't happen and
you have remember more state going down. And you probably can't split
the ctid out anyway, because you might end up needing a whole page per
key value then.

Still, I know a large 7 column index where the first three columns
don't change that often and would benefit from this kind of
optimisation.

> It might be worth teaching the optimiser that if it has an index on an
> immutable function that if we have WHERE x = k and a functional index on
> f(x) then we can access the functional index with
> f(x) = f(k), as long as we also reapply the original WHERE clause.

Until we have a better idea about how much functions cost we should be
wary of this. Also, consider if you had an index on sign(x). It's
immutable sure, but useless from a planner prespective for the
optimisation you suggest. How would it know? Presumably there may be
statistics on this but still...

Have a ncie day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: prefix btree implementation

From

Simon Riggs

Date:

06 October 2005, 16:05:49

On Thu, 2005-10-06 at 09:38 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > It might be worth teaching the optimiser that if it has an index on an
> > immutable function that if we have WHERE x = k and a functional index on
> > f(x) then we can access the functional index with 
> > f(x) = f(k), as long as we also reapply the original WHERE clause. 
> 
> As I just pointed out to Gaetano, this is utterly wrong.  We can't
> assume that much about the behavior of equality.

For any function, yes, because you can always construct a function that
violates that. But it seems straightforward to introduce another
declarative form for which it is true, similar to the way that IMMUTABLE
allows us to make other assumptions at parse time. That's only worth it
if you believe that many useful functions follow that rule; I would say
that they do.

Best Regards, Simon Riggs

Re: prefix btree implementation

From

"Jim C. Nasby"

Date:

06 October 2005, 18:42:20

On Wed, Oct 05, 2005 at 03:40:43PM -0700, Qingqing Zhou wrote:
> We do the prefix sharing when we build up index only, never on the fly.

So are you saying that inserts of new data wouldn't make any use of
this? ISTM that greatly reduces the usefulness, though I'm not objecting
because compression during build is probably better than none at all. Is
there a technical reason compression can't be used during normal
operations?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: prefix btree implementation

From

Bruce Momjian

Date:

06 October 2005, 23:43:59

Jim C. Nasby wrote:
> On Wed, Oct 05, 2005 at 03:40:43PM -0700, Qingqing Zhou wrote:
> > We do the prefix sharing when we build up index only, never on the fly.
> 
> So are you saying that inserts of new data wouldn't make any use of
> this? ISTM that greatly reduces the usefulness, though I'm not objecting
> because compression during build is probably better than none at all. Is
> there a technical reason compression can't be used during normal
> operations?

Added to TODO:

* Consider compressing indexes by storing key prefix values shared by several rows as a single index entry

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: prefix btree implementation

From

Qingqing Zhou

Date:

07 October 2005, 00:19:51

On Thu, 6 Oct 2005, Jim C. Nasby wrote:

> On Wed, Oct 05, 2005 at 03:40:43PM -0700, Qingqing Zhou wrote:
> > We do the prefix sharing when we build up index only, never on the fly.
>
> So are you saying that inserts of new data wouldn't make any use of
> this? ISTM that greatly reduces the usefulness, though I'm not objecting
> because compression during build is probably better than none at all. Is
> there a technical reason compression can't be used during normal
> operations?
>

Yes, there are. Think if we do it we when build up index, we can choose
the shared prefix optimally w.r.t. maximizing the number of index items on
the page. But on the fly, if we do so, I am afraid this will (1) kill the
performance; (2) introduce more complexities. I don't exclude the
possibility of doing prefix sharing on the fly, but for current stage, I
would like to first come up with a proof-of-concept patch not including
this part.

Regards,
Qingqing

Re: prefix btree implementation

From

Simon Riggs

Date:

07 October 2005, 05:44:55

On Thu, 2005-10-06 at 22:43 -0400, Bruce Momjian wrote:
> Jim C. Nasby wrote:
> > On Wed, Oct 05, 2005 at 03:40:43PM -0700, Qingqing Zhou wrote:
> > > We do the prefix sharing when we build up index only, never on the fly.
> > 
> > So are you saying that inserts of new data wouldn't make any use of
> > this? ISTM that greatly reduces the usefulness, though I'm not objecting
> > because compression during build is probably better than none at all. Is
> > there a technical reason compression can't be used during normal
> > operations?
> 
> Added to TODO:
> 
> * Consider compressing indexes by storing key prefix values shared by
>   several rows as a single index entry

Just to re-iterate Tom's point. There isn't any easy way of doing this
in an object-relational database where we can make almost no assumptions
about particular datatypes. There is no definition of datatype prefix...
The best you could do would be to introduce a new API that allows a
datatype to provide a shorter prefix value when given a starting data
value, but then you'd need to write a whole set of prefixing functions.
But that is almost identical to the idea of functional indexes anyhow,
so I see no value in providing a second way of doing this when the
existing way can be more easily modified to do this. 

I do not think key prefixing should be on the TODO for PostgreSQL,
unless we also agree that datatype independence should not be maintained
in all cases and can be relaxed for built-in datatypes.

I suggest we reword that to "Investigate techniques for compressing
indexes that will work successfully with datatype independence".

What we might consider is having the index store a chain of pointers as
an array on the index row, rather than having each row map to one index
row. That technique would considerably reduce index volume for non-
unique indexes by removing lots of index tuple overhead. But you would
need to keep at most N pointers on a row. This technique is used by
Teradata's Non Unique Secondary Index design.

Best Regards, Simon Riggs

Re: prefix btree implementation

From

Bruce Momjian

Date:

07 October 2005, 11:04:22

OK, TODO updated:

* Consider compressing indexes by storing key values duplicated in several rows as a single index entry
 This is difficult because it requires datatype-specific knowledge.


---------------------------------------------------------------------------

Simon Riggs wrote:
> On Thu, 2005-10-06 at 22:43 -0400, Bruce Momjian wrote:
> > Jim C. Nasby wrote:
> > > On Wed, Oct 05, 2005 at 03:40:43PM -0700, Qingqing Zhou wrote:
> > > > We do the prefix sharing when we build up index only, never on the fly.
> > > 
> > > So are you saying that inserts of new data wouldn't make any use of
> > > this? ISTM that greatly reduces the usefulness, though I'm not objecting
> > > because compression during build is probably better than none at all. Is
> > > there a technical reason compression can't be used during normal
> > > operations?
> > 
> > Added to TODO:
> > 
> > * Consider compressing indexes by storing key prefix values shared by
> >   several rows as a single index entry
> 
> Just to re-iterate Tom's point. There isn't any easy way of doing this
> in an object-relational database where we can make almost no assumptions
> about particular datatypes. There is no definition of datatype prefix...
> The best you could do would be to introduce a new API that allows a
> datatype to provide a shorter prefix value when given a starting data
> value, but then you'd need to write a whole set of prefixing functions.
> But that is almost identical to the idea of functional indexes anyhow,
> so I see no value in providing a second way of doing this when the
> existing way can be more easily modified to do this. 
> 
> I do not think key prefixing should be on the TODO for PostgreSQL,
> unless we also agree that datatype independence should not be maintained
> in all cases and can be relaxed for built-in datatypes.
> 
> I suggest we reword that to "Investigate techniques for compressing
> indexes that will work successfully with datatype independence".
> 
> What we might consider is having the index store a chain of pointers as
> an array on the index row, rather than having each row map to one index
> row. That technique would considerably reduce index volume for non-
> unique indexes by removing lots of index tuple overhead. But you would
> need to keep at most N pointers on a row. This technique is used by
> Teradata's Non Unique Secondary Index design.
> 
> Best Regards, Simon Riggs
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073