# Re: [GENERAL] Creation of tsearch2 index is very - Mailing list pgsql-performance

From | Ron |
---|---|

Subject | Re: [GENERAL] Creation of tsearch2 index is very |

Date | |

Msg-id | 7.0.1.0.2.20060126232407.03b2f418@earthlink.net Whole thread Raw |

In response to | Re: [GENERAL] Creation of tsearch2 index is very slow (Martijn van Oosterhout) |

List | pgsql-performance |

At 11:13 PM 1/26/2006, Craig A. James wrote: >Ron, > >I'll write to you privately, because these discussions can get messy >"in public". I'm responding to this missive publicly in an attempt to help the discussion along. It is not my usual practice to respond to private messages publicly, but this seems a good case for an exception. >>You seem to have missed my point. I just gave a very clear >>description of how to "decide which bitmaps go in each of the two >>buckets" by reformulating the question into "decide which bitmaps >>go in each of =four= buckets". > >Sorry to disagree, but here's the problem. It's not whether you put >them into two, or four, or N buckets. The problem is, how do you >categorize them to begin with, so that you have some reasonable >criterion for which item goes in which bucket? THAT is the hard >problem, not whether you have two or four buckets. Agreed. ...and I've given the answer to "how do you categorize them" using a general property of RD Trees which should result in "a reasonable criterion for which item goes in which bucket" when used for text searching. The definition of RD tree keys being either "atomic" (non decomposable) or "molecular" (containing the keys of their descendents) is the one source of our current problems figuring out how to split them and, if I'm correct, a hint as to how to solve the splitting problem in O(1) time while helping to foster high performance during seearches. >Earlier, you wrote: >>Traditional B-tree ordering methods can be used to define the >>ordering/placement of pages within each index, which will minimize >>head seeks to find the correct page to scan. >>Since the criteria for putting a key within a page or starting a >>new page is simple, performance for those tasks should be O(1). > >What are the "traditional B-tree ordering methods"? That's the >issue, right there. The problem is that bitmaps HAVE NO ORDERING >METHOD. You can't sort them numerically or alphabetically. The =bitmaps= have no ordering method. =Pages= of bitmaps MUST have an ordering method or we have no idea which page to look at when searching for a key. Treating the "root" bitmaps (those that may have descendents but have no parents) as numbers and ordering the pages using B tree creation methods that use those numbers as keys is a simple way to create a balanced data structure with high fan out. IOW, a recipe for finding the proper page(s) to scan using minimal seeks. >Take a crowd of people. It's easy to divide them in half by names: >just alphabetize and split the list in half. That's what your >solution would improve on. > >But imagine I gave you a group of people and told you to put them >into two rooms, such that when you are through, the people in each >room are maximally similar to each other and maximally dissimilar to >the people in the other room. How would you do it? > >First of all, you'd have to define what "different" and "same" >are. Is your measure based on skin color, hair color, age, height, >weight, girth, intelligence, speed, endurance, body odor, ... >? Suppose I tell you, "All of those". You have to sort the people >so that your two groups are separated such that the group >differences are maximized in this N-dimensional >space. Computationally, it's a nearly impossible problem. I'm =changing= the problem using the semantics of RD trees. Using an RD tree representation, we'd create and assign a key for each person that ranked them compared to everyone else for each of the metrics we decided to differentiate on. Then we start to form trees of keys to these people by creating "parent" keys as roots that contain the union of everyone with the same or possibly similar value for some quality. By iterating this process, we end up with a bottom up construction method for an RD tree whose root key will the union of all the keys representing these people. If this is one data structure, we end up with an efficient and effective way of answering not only Boolean but also ranking and similarity type queries. The problem comes when we have to split this monolithic DS into pieces for best performance. As many have noted, there is no simple way to decide how to do such a thing. OTOH, we =do= have the knowledge of how RD trees are built and what their keys represent, and we know that queries are going to tend strongly to either a) traverse the path from parent to child (depth first) or b) find all siblings with (dis)similar characteristics (breadth first), or c) contain a mixture of a) and b). So I'm suggesting that conceptually we clone the original RD tree and we split each clone according to two different methods. Method A results in pages that contain as many complete depth first paths from root to leave as possible on each page. Method B results in pages that contain as many siblings as possible per page. ...and we use the appropriate index during each type of query or query part. In return for using 2x as much space, we should have a general method that is O(1) for decomposing RD trees in such a way as to support high performance during searches. >Here's a much simpler version of the problem. Suppose I give you >1000 numbers, and tell you to divide them in half so that the two >groups have the smallest standard deviation within each group >possible, and the two average values of the groups of numbers are >the farthest apart possible. Pretty easy, right? > >Now do it where "distance" is evaluated modulo(10), that is, 1 and 9 >are closer together than 1 and 3. Now try it -- you'll find that >this problem is almost impossible to solve. This is orthogonal to the discussion at hand since the above is not akin to text searching nor best done with RD trees and that is exactly what this discussion is about. We don't have to solve a general problem for all domains. We only have to solve it for the specific domain of text search using the specific DS of RD trees. >The problem is that, as database designers, we're used to text and >numbers, which have an inherent order. Bitmaps have no ordering -- >you can't say one is "greater than" or "less than" the other. All >you can do is determine that A and B are "more similar" or "less >similar" than A and C. Text and numbers only have an order because we deem them to. There is even a very common default order we tend to use. But even in the trivial case of ranking we've all said that "1" is better than "2" in one situation and "2" is better than "1" in another. In the case of text searching, the bitmaps represent where to find legomena, AKA specific tokens. Particularly hapax legomena, AKA unique tokens. Hapax Legomena are particularly important because they are maximally efficient at driving the search needed to answer a query. While absolute hapax legomena are great for quickly pruning things within a document or document region, relative hapax legomena can do the same thing when searching among multiple documents or document regions. The two indexes I'm suggesting are designed to take advantage of this general property of text searching. Hopefully this clarifies things and motivates a better discussion? Ron

pgsql-performance by date: