Home > mailing lists

Re: [GENERAL] Creation of tsearch2 index is very - Mailing list pgsql-performance

From	Craig A. James
Subject	Re: [GENERAL] Creation of tsearch2 index is very
Date	January 26, 2006 22:33:58
Msg-id	43D98582.1050501@modgraph-usa.com Whole thread Raw
In response to	Re: [GENERAL] Creation of tsearch2 index is very (Ron <rjpeace@earthlink.net>)
Responses	Re: [GENERAL] Creation of tsearch2 index is very
List	pgsql-performance

Tree view

Ron <rjpeace@earthlink.net> writes:
> We have two problems here.
> The first is that the page splitting code for these indexes currently
> has O(N^2) performance.
> The second is that whatever solution we do use for this functionality,
> we still need good performance during searches that use the index.

No, unfortunately that's not the problem that needs to be solved.

The problem is figuring out WHICH records to put in the "left" and "right" trees once you split them.  If you can
figurethat out, then your suggestion (and perhaps other techniques) could be useful. 

The problem boils down to this:  You have a whole bunch of essentially random bitmaps.  You have two buckets.  You want
toput half of the bitmaps in one bucket, and half in the other bucket, and when you get through, you want all of the
bitmapsin each bucket to be maximally similar to each other, and maximally dissimilar to the ones in the other bucket. 

That way, when you OR all the bitmaps in each bucket together to build the bitmap for the left and right child nodes of
thetree, you'll get maximum separation -- the chances that you'll have to descend BOTH the left and right nodes of the
treeare minimized. 

Unfortunately, this problem is very likely in the set of NP-complete problems, i.e. like the famous "Traveling Salesman
Problem,"you can prove there's no algorithm that will give the answer in a reasonable time.  In this case, "reasonable"
wouldbe measured in milliseconds to seconds, but in fact an actual "perfect" split of a set of bitmaps probably can't
becomputed in the lifetime of the universe for more than a few hundred bitmaps. 

That's the problem that's being discussed: How do you decide which bitmaps go in each of the two buckets?  Any solution
willnecessarily be imperfect, a pragmatic algorithm that gives an imperfect, but acceptable, answer.   

As I mentioned earlier, chemists make extensive use of bitmaps to categorize and group molecules.  They use Tanimoto or
Tverskysimilarity metrics (Tanimoto is a special case of Tversky), because it's extremely fast to compare two bitmaps,
andthe score is highly correlated with the number of bits the two bitmaps have in common. 

But even with a fast "distance" metric like Tanimoto, there's still no easy way to decide which bucket to put each
bitmapinto. 

Craig

pgsql-performance by date:

From: Ron
Date: 26 January 2006, 21:55:53
Subject: Re: [GENERAL] Creation of tsearch2 index is very

From: Ron
Date: 26 January 2006, 23:33:11
Subject: Re: [GENERAL] Creation of tsearch2 index is very

Re: [GENERAL] Creation of tsearch2 index is very - Mailing list pgsql-performance

Previous

Next