Thread: prefix btree implementation
I am not sure if this idea was mentioned before. The basic prefix btree idea is quite straightforward, i.e., try to compress the key items within a data page by sharing the common prefix. Thus the fanout of the page is increased and the benefits is obvious theorectically. Consider the following cases: 1/ Multi-attributes index, example page looks like this: [1, 1, 'abc'], [1, 1, 'bde'], ..., [1, 22, 'mmk'] - So [1, 1] or [1] is compressible 2/ Index on string, example page looks like this: ['aaab'] ['aaac'], ..., ['aabc'] - So ['aaa'] or ['aa'] is compressible 3/ Both 1 and 2, example page looks like this: [1, 'abc', 1], [1, 'abc', 2], ..., [1, 'abk', 1] - So [1, 'abc'] or [1,'ab'] is compressible 4/ And even this, example page looks like this: [0x00001234], [0x00001235], ..., [0x00002213] - So [0x0000] is compressible So together, there are basically four types of possible sharing: column-wise (case 1), character-wise (case 2), column-character-wise (case 3), and byte-wise (case 4). To compress these prefixes, we have the following questions: 1/ What types of prefix compression shall we support? We can consider case 3 as a generalized case of 1 and 2, thus at least we can implement for case 1, 2 and 3 now. Case 4 could be seen as a special case of case 2, so the framework should be scalable enough to add supports to 4. In short, we will implement 1,2 and 3 now. 2/ When to do prefix compression and what index operations will be affected? To simplify things, we can do it when we do bottom-up index rebuild. We won't try to compress the prefix when inserting new data. By this design, I could think that the index operations affected will include btree build, btree split (new page has to inheritant prefix information). 3/ How to represent each index tuple and where to store the common prefix? Index tuples involved in the compression have to remember how much have been shared and the shared prefix should be stored within the same page. The basic idea is that we will use bit 13 of IndexTuple->t_info to indicate that if it shares a prefix or not, and the common prefix will be stored in the special area of an index page. The common prefix data will have format {length|value}. Thanks that the PageHeader->pd_special is dyanmic, so the common prefix could have different length on each page. Based on these changes, we will change index_getattr() and _bt_pageinit() accordingly. 4/ WAL support? We only affect btree build and btree split. For btree build, we don't worry too much, since the data is already page-wise xloged. We will add prefix information to btree split xlog record (xl_btree_split). Comments? Regards, Qingqing
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > 1/ What types of prefix compression shall we support? Given the requirement of datatype independence, this idea seems a complete nonstarter to me... regards, tom lane
On Wed, Oct 05, 2005 at 12:50:41AM -0400, Tom Lane wrote: > Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > > 1/ What types of prefix compression shall we support? > > Given the requirement of datatype independence, this idea seems a > complete nonstarter to me... How about having each type optionally provide the required routines? Thus we could provide them at least for the most common datatypes, and the system would continue working as currently for the rest (including user-defined types). Cross-column prefixes would be hard to handle I guess, as well as TOASTed data. One problem I do see is what happens if I need to insert a new tuple in the page that doesn't share the prefix. It obviously would have to be the leftmost or rightmost item on the page, but it's possible. -- Alvaro Herrera http://www.PlanetPostgreSQL.org A male gynecologist is like an auto mechanic who never owned a car. (Carrie Snow)
Qingqing Zhou wrote: > I am not sure if this idea was mentioned before. > > The basic prefix btree idea is quite straightforward, i.e., try to > compress the key items within a data page by sharing the common prefix. > Thus the fanout of the page is increased and the benefits is obvious > theorectically. > <snip> > > So together, there are basically four types of possible sharing: > column-wise (case 1), character-wise (case 2), column-character-wise (case > 3), and byte-wise (case 4). > Oracle implements something similar called index compression, but I believe it is only for common column values. I haven't checked in versions>9r1 so maybe there are other options implemented by now. Jonathan Lewis describes some pros and cons here: http://www.jlcomp.demon.co.uk/faq/compress_ind.html -- _______________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. _______________________________
"Bricklen Anderson" <BAnderson@PresiNET.com> wrote > > Oracle implements something similar called index compression, but I > believe it > is only for common column values. I haven't checked in versions>9r1 so > maybe > there are other options implemented by now. > > Jonathan Lewis describes some pros and cons here: > http://www.jlcomp.demon.co.uk/faq/compress_ind.html > Oracle 9 uses the grammar like this: CREATE INDEX ... [ COMPRESS <number_of_first_columns> ] So it gives the flexibility of choosing optimal number of coulumns to the user. The script mentioned in the article guesses the optimal number by estimating the size of each choice. But I am thinking we can do it better: (1) we don't require that the compressed number of columns on each page are the same; (2) when we build up index bottom-up, we can determine this number for each page automatically by maximizing the number of items within a page. Regards, Qingqing
"Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote > On Wed, Oct 05, 2005 at 12:50:41AM -0400, Tom Lane wrote: >> Qingqing Zhou <zhouqq@cs.toronto.edu> writes: >> > 1/ What types of prefix compression shall we support? >> >> Given the requirement of datatype independence, this idea seems a >> complete nonstarter to me... > > How about having each type optionally provide the required routines? > Thus we could provide them at least for the most common datatypes, and > the system would continue working as currently for the rest (including > user-defined types). Cross-column prefixes would be hard to handle I > guess, as well as TOASTed data. > Yes, column-wise should not be difficult since it does require no knowledge of the data types. We can treat cross-column or incomplete-column share in two ways. One way (binary-comparison) is that we just compare the items binaryly, i.e., without knowing what in fact is stored. The other way (datatype-comparison) is we compare the items with some knowlege of the data types. For example, suppose the index is defined on a varchar column and the examplar data look like this: {3|'aaa'}{4|'aaab'}{5|'aaabc'} The binary-comparison way can't share prefix 'aaa' at all, because it will see 3, 4 and 5 are totally different. The datatype-comparison way can share the prefix 'aaa', but it has to know that each varchar column is associated with a length header. When it compares, it ignores the header. The first way is easy to implement, and works for some data types like integers, but no acceptable I guess, since it even does not support varchar column prefix sharing. We can find a way to handle the above case, but it is better to find a general way to handle any data types(include UDT). "Each type optionally provide the required routines" could be a way, more details? > One problem I do see is what happens if I need to insert a new tuple in > the page that doesn't share the prefix. It obviously would have to be > the leftmost or rightmost item on the page, but it's possible. > We do the prefix sharing when we build up index only, never on the fly. Regards, Qingqing
Qingqing Zhou wrote: > Oracle 9 uses the grammar like this: > > CREATE INDEX ... [ COMPRESS <number_of_first_columns> ] > > So it gives the flexibility of choosing optimal number of coulumns to the > user. The script mentioned in the article guesses the optimal number by > estimating the size of each choice. But I am thinking we can do it better: > (1) we don't require that the compressed number of columns on each page are > the same; (2) when we build up index bottom-up, we can determine this number > for each page automatically by maximizing the number of items within a page. Are there any gains in eliminating duplicate values in result-sets? I'd guess that many/most large result-sets are sorted which should make it possible to get away with a "same as last row" marker when the whole set is returned to a client. Of course, this is where someone turns around and tells me we do this already :-) -- Richard Huxton Archonet Ltd
Hello all, I also was examining a similar compression method just. Qingqing Zhou wrote: > We can find a way to handle the above case, but it is better to find a > general way to handle any data types(include UDT). "Each type optionally > provide the required routines" could be a way, more details? How about the use of difference information with "High key"? Because "High key" information exists also in the route page, I think that it seems to be able to use it well. (e.g. new tuple insertion) # There is a problem that in the rightmost page, "High key" is not... My consideration was just started, whether it goes well really has not been understood yet. --- Junji Teramoto
On Wed, 2005-10-05 at 00:50 -0400, Tom Lane wrote: > Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > > 1/ What types of prefix compression shall we support? > > Given the requirement of datatype independence, this idea seems a > complete nonstarter to me... It would be possible to compress on similar values, since we know the output of the comparison in the final stage of the sort of the index build. That wouldn't need to rely upon anything to do with the datatype, since "they are equal" is a fact outside the encapsulation, and is arrived at by use of the datatype's own comparison logic. But that isn't prefix compression, just compression. But why do we want this? Its very easy to work out a data-aware prefixing or compression scheme and then encapulate that in a function. The function can then be used in a functional index and the usage hidden behind a view. It might be worth teaching the optimiser that if it has an index on an immutable function that if we have WHERE x = k and a functional index on f(x) then we can access the functional index with f(x) = f(k), as long as we also reapply the original WHERE clause. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > It might be worth teaching the optimiser that if it has an index on an > immutable function that if we have WHERE x = k and a functional index on > f(x) then we can access the functional index with > f(x) = f(k), as long as we also reapply the original WHERE clause. As I just pointed out to Gaetano, this is utterly wrong. We can't assume that much about the behavior of equality. regards, tom lane
On Thu, Oct 06, 2005 at 08:53:25AM +0100, Simon Riggs wrote: > It would be possible to compress on similar values, since we know the > output of the comparison in the final stage of the sort of the index > build. That wouldn't need to rely upon anything to do with the datatype, > since "they are equal" is a fact outside the encapsulation, and is > arrived at by use of the datatype's own comparison logic. Well, one thing I would be curious about would be when you index a multicolumn index, not storing the value of each column in each index entry? Couldn't there be a btree on the first key, then at the leaf a pointer to a new btree on the second key, etc. It would save a lot of space in large multicolumn indexes, no? And since ctid is an automatic member of each indexkey, it could automatically remove common key elements. Codingwise however, it sucks. The whole index_getattr can't happen and you have remember more state going down. And you probably can't split the ctid out anyway, because you might end up needing a whole page per key value then. Still, I know a large 7 column index where the first three columns don't change that often and would benefit from this kind of optimisation. > It might be worth teaching the optimiser that if it has an index on an > immutable function that if we have WHERE x = k and a functional index on > f(x) then we can access the functional index with > f(x) = f(k), as long as we also reapply the original WHERE clause. Until we have a better idea about how much functions cost we should be wary of this. Also, consider if you had an index on sign(x). It's immutable sure, but useless from a planner prespective for the optimisation you suggest. How would it know? Presumably there may be statistics on this but still... Have a ncie day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Thu, 2005-10-06 at 09:38 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > It might be worth teaching the optimiser that if it has an index on an > > immutable function that if we have WHERE x = k and a functional index on > > f(x) then we can access the functional index with > > f(x) = f(k), as long as we also reapply the original WHERE clause. > > As I just pointed out to Gaetano, this is utterly wrong. We can't > assume that much about the behavior of equality. For any function, yes, because you can always construct a function that violates that. But it seems straightforward to introduce another declarative form for which it is true, similar to the way that IMMUTABLE allows us to make other assumptions at parse time. That's only worth it if you believe that many useful functions follow that rule; I would say that they do. Best Regards, Simon Riggs
On Wed, Oct 05, 2005 at 03:40:43PM -0700, Qingqing Zhou wrote: > We do the prefix sharing when we build up index only, never on the fly. So are you saying that inserts of new data wouldn't make any use of this? ISTM that greatly reduces the usefulness, though I'm not objecting because compression during build is probably better than none at all. Is there a technical reason compression can't be used during normal operations? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > On Wed, Oct 05, 2005 at 03:40:43PM -0700, Qingqing Zhou wrote: > > We do the prefix sharing when we build up index only, never on the fly. > > So are you saying that inserts of new data wouldn't make any use of > this? ISTM that greatly reduces the usefulness, though I'm not objecting > because compression during build is probably better than none at all. Is > there a technical reason compression can't be used during normal > operations? Added to TODO: * Consider compressing indexes by storing key prefix values shared by several rows as a single index entry -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, 6 Oct 2005, Jim C. Nasby wrote: > On Wed, Oct 05, 2005 at 03:40:43PM -0700, Qingqing Zhou wrote: > > We do the prefix sharing when we build up index only, never on the fly. > > So are you saying that inserts of new data wouldn't make any use of > this? ISTM that greatly reduces the usefulness, though I'm not objecting > because compression during build is probably better than none at all. Is > there a technical reason compression can't be used during normal > operations? > Yes, there are. Think if we do it we when build up index, we can choose the shared prefix optimally w.r.t. maximizing the number of index items on the page. But on the fly, if we do so, I am afraid this will (1) kill the performance; (2) introduce more complexities. I don't exclude the possibility of doing prefix sharing on the fly, but for current stage, I would like to first come up with a proof-of-concept patch not including this part. Regards, Qingqing
On Thu, 2005-10-06 at 22:43 -0400, Bruce Momjian wrote: > Jim C. Nasby wrote: > > On Wed, Oct 05, 2005 at 03:40:43PM -0700, Qingqing Zhou wrote: > > > We do the prefix sharing when we build up index only, never on the fly. > > > > So are you saying that inserts of new data wouldn't make any use of > > this? ISTM that greatly reduces the usefulness, though I'm not objecting > > because compression during build is probably better than none at all. Is > > there a technical reason compression can't be used during normal > > operations? > > Added to TODO: > > * Consider compressing indexes by storing key prefix values shared by > several rows as a single index entry Just to re-iterate Tom's point. There isn't any easy way of doing this in an object-relational database where we can make almost no assumptions about particular datatypes. There is no definition of datatype prefix... The best you could do would be to introduce a new API that allows a datatype to provide a shorter prefix value when given a starting data value, but then you'd need to write a whole set of prefixing functions. But that is almost identical to the idea of functional indexes anyhow, so I see no value in providing a second way of doing this when the existing way can be more easily modified to do this. I do not think key prefixing should be on the TODO for PostgreSQL, unless we also agree that datatype independence should not be maintained in all cases and can be relaxed for built-in datatypes. I suggest we reword that to "Investigate techniques for compressing indexes that will work successfully with datatype independence". What we might consider is having the index store a chain of pointers as an array on the index row, rather than having each row map to one index row. That technique would considerably reduce index volume for non- unique indexes by removing lots of index tuple overhead. But you would need to keep at most N pointers on a row. This technique is used by Teradata's Non Unique Secondary Index design. Best Regards, Simon Riggs
OK, TODO updated: * Consider compressing indexes by storing key values duplicated in several rows as a single index entry This is difficult because it requires datatype-specific knowledge. --------------------------------------------------------------------------- Simon Riggs wrote: > On Thu, 2005-10-06 at 22:43 -0400, Bruce Momjian wrote: > > Jim C. Nasby wrote: > > > On Wed, Oct 05, 2005 at 03:40:43PM -0700, Qingqing Zhou wrote: > > > > We do the prefix sharing when we build up index only, never on the fly. > > > > > > So are you saying that inserts of new data wouldn't make any use of > > > this? ISTM that greatly reduces the usefulness, though I'm not objecting > > > because compression during build is probably better than none at all. Is > > > there a technical reason compression can't be used during normal > > > operations? > > > > Added to TODO: > > > > * Consider compressing indexes by storing key prefix values shared by > > several rows as a single index entry > > Just to re-iterate Tom's point. There isn't any easy way of doing this > in an object-relational database where we can make almost no assumptions > about particular datatypes. There is no definition of datatype prefix... > The best you could do would be to introduce a new API that allows a > datatype to provide a shorter prefix value when given a starting data > value, but then you'd need to write a whole set of prefixing functions. > But that is almost identical to the idea of functional indexes anyhow, > so I see no value in providing a second way of doing this when the > existing way can be more easily modified to do this. > > I do not think key prefixing should be on the TODO for PostgreSQL, > unless we also agree that datatype independence should not be maintained > in all cases and can be relaxed for built-in datatypes. > > I suggest we reword that to "Investigate techniques for compressing > indexes that will work successfully with datatype independence". > > What we might consider is having the index store a chain of pointers as > an array on the index row, rather than having each row map to one index > row. That technique would considerably reduce index volume for non- > unique indexes by removing lots of index tuple overhead. But you would > need to keep at most N pointers on a row. This technique is used by > Teradata's Non Unique Secondary Index design. > > Best Regards, Simon Riggs > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073