Re: [GENERAL] Creation of tsearch2 index is very - Mailing list pgsql-performance

From Ron
Subject Re: [GENERAL] Creation of tsearch2 index is very
Whole thread Raw
In response to Re: [GENERAL] Creation of tsearch2 index is very slow  (Martijn van Oosterhout)
List pgsql-performance
At 11:13 PM 1/26/2006, Craig A. James wrote:
>I'll write to you privately, because these discussions can get messy
>"in public".

I'm responding to this missive publicly in an attempt to help the
discussion along.  It is not my usual practice to respond to private
messages publicly, but this seems a good case for an exception.

>>You seem to have missed my point.  I just gave a very clear
>>description of how to "decide which bitmaps go in each of the two
>>buckets" by reformulating the question into "decide which bitmaps
>>go in each of =four= buckets".
>Sorry to disagree, but here's the problem. It's not whether you put
>them into two, or four, or N buckets.  The problem is, how do you
>categorize them to begin with, so that you have some reasonable
>criterion for which item goes in which bucket?  THAT is the hard
>problem, not whether you have two or four buckets.

Agreed.  ...and I've given the answer to "how do you categorize them"
using a general property of RD Trees which should result in "a
reasonable criterion for which item goes in which bucket" when used
for text searching.

The definition of RD tree keys being either "atomic" (non
decomposable) or "molecular" (containing the keys of their
descendents) is the one source of our current problems figuring out
how to split them and, if I'm correct, a hint as to how to solve the
splitting problem in O(1) time while helping to foster high
performance during seearches.

>Earlier, you wrote:
>>Traditional B-tree ordering methods can be used to define the
>>ordering/placement of pages within each index, which will minimize
>>head seeks to find the correct page to scan.
>>Since the criteria for putting a key within a page or starting a
>>new page is simple, performance for those tasks should be O(1).
>What are the "traditional B-tree ordering methods"?  That's the
>issue, right there.  The problem is that bitmaps HAVE NO ORDERING
>METHOD.  You can't sort them numerically or alphabetically.
The =bitmaps= have no ordering method.  =Pages= of bitmaps MUST have
an ordering method or we have no idea which page to look at when
searching for a key.

Treating the "root" bitmaps (those that may have descendents but have
no parents) as numbers and ordering the pages using B tree creation
methods that use those numbers as keys is a simple way to create a
balanced data structure with high fan out.  IOW, a recipe for finding
the proper page(s) to scan using minimal seeks.

>Take a crowd of people.  It's easy to divide them in half by names:
>just alphabetize and split the list in half.  That's what your
>solution would improve on.
>But imagine I gave you a group of people and told you to put them
>into two rooms, such that when you are through, the people in each
>room are maximally similar to each other and maximally dissimilar to
>the people in the other room.  How would you do it?
>First of all, you'd have to define what "different" and "same"
>are.  Is your measure based on skin color, hair color, age, height,
>weight, girth, intelligence, speed, endurance, body odor, ...
>?  Suppose I tell you, "All of those".  You have to sort the people
>so that your two groups are separated such that the group
>differences are maximized in this N-dimensional
>space.  Computationally, it's a nearly impossible problem.
I'm =changing= the problem using the semantics of RD trees.  Using an
RD tree representation, we'd create and assign a key for each person
that ranked them compared to everyone else for each of the metrics we
decided to differentiate on.

Then we start to form trees of keys to these people by creating
"parent" keys as roots that contain the union of everyone with the
same or possibly similar value for some quality.  By iterating this
process, we end up with a bottom up construction method for an RD
tree whose root key will the union of all the keys representing these
people.  If this is one data structure, we end up with an efficient
and effective way of answering not only Boolean but also ranking and
similarity type queries.

The problem comes when we have to split this monolithic DS into
pieces for best performance.  As many have noted, there is no
simple  way to decide how to do such a thing.

OTOH, we =do= have the knowledge of how RD trees are built and what
their keys represent, and we know that queries are going to tend
strongly to either a) traverse the path from parent to child (depth
first) or b) find all siblings with (dis)similar characteristics
(breadth first), or c) contain a mixture of a) and b).

So I'm suggesting that conceptually we clone the original RD tree and
we split each clone according to two different methods.
Method A results in pages that contain as many complete depth first
paths from root to leave as possible on each page.
Method B results in pages that contain as many siblings as possible per page.
...and we use the appropriate index during each type of query or query part.

In return for using 2x as much space, we should have a general method
that is O(1) for decomposing RD trees in such a way as to support
high performance during searches.

>Here's a much simpler version of the problem.  Suppose I give you
>1000 numbers, and tell you to divide them in half so that the two
>groups have the smallest standard deviation within each group
>possible, and the two average values of the groups of numbers are
>the farthest apart possible.  Pretty easy, right?
>Now do it where "distance" is evaluated modulo(10), that is, 1 and 9
>are closer together than 1 and 3.  Now try it -- you'll find that
>this problem is almost impossible to solve.
This is orthogonal to the discussion at hand since the above is not
akin to text searching nor best done with RD trees and that is
exactly what this discussion is about.  We don't have to solve a
general problem for all domains.  We only have to solve it for the
specific domain of text search using the specific DS of RD trees.

>The problem is that, as database designers, we're used to text and
>numbers, which have an inherent order.  Bitmaps have no ordering --
>you can't say one is "greater than" or "less than" the other.  All
>you can do is determine that A and B are "more similar" or "less
>similar" than A and C.
Text and numbers only have an order because we deem them to.  There
is even a very common default order we tend to use.  But even in the
trivial case of ranking we've all said that "1" is better than "2" in
one situation and "2" is better than "1" in another.

In the case of text searching, the bitmaps represent where to find
legomena, AKA specific tokens.  Particularly hapax legomena, AKA
unique tokens.  Hapax Legomena are particularly important because
they are maximally efficient at driving the search needed to answer a query.

While absolute hapax legomena are great for quickly pruning things
within a document or document region, relative hapax legomena can do
the same thing when searching among multiple documents or document regions.

The two indexes I'm suggesting are designed to take advantage of this
general property of text searching.

Hopefully this clarifies things and motivates a better discussion?

pgsql-performance by date:

From: Ron
Subject: Re: [GENERAL] Creation of tsearch2 index is very
From: "Mike Biamonte"
Subject: Huge Data sets, simple queries