Re: OK, does anyone have any better ideas? - Mailing list pgsql-hackers

From Oleg Bartunov
Subject Re: OK, does anyone have any better ideas?
Date
Msg-id Pine.GSO.3.96.SK.1001209163702.4174p-100000@ra
Whole thread Raw
In response to Re: OK, does anyone have any better ideas?  (mlw <markw@mohawksoft.com>)
List pgsql-hackers
On Sat, 9 Dec 2000, mlw wrote:

> Date: Sat, 09 Dec 2000 08:18:10 -0500
> From: mlw <markw@mohawksoft.com>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: Tom Lane <tgl@sss.pgh.pa.us>,
>     Hackers List <pgsql-hackers@postgresql.org>
> Subject: Re: [HACKERS] OK, does anyone have any better ideas?
> 
> Oleg Bartunov wrote:
> > 
> > We need multi-key B-tree like index for such problem.
> > Our full text search engine is based on arrays and we need to find quickly
> > is some number exists in array - some kind of index over int array.
> > We're currently testing GiST approach and seems will have some conclusions
> > soon. I think multi-key B-tree like index would be better in my
> > opinion, but this requires to much work and knowledge of postgres's internals.
> > Yesterday I read about UBTree, seems like it's good for index and query
> > sets. Currently postgres has no set specific methods.
> 
> The way I do my search indexing is with bitmap objects and a word
> dictionary. One creates a searchable dictionary of words by scanning the
> selected records. So, in one query that results in 2 million records,
> the total aggregate number of words is about 60,000 depending on how you
> parse. For each word, you create a "bitmap object" (in one of a few
> forms) where bit '0' represents the first record, bit '1' represents the
> second, and so on, until you have 2 million bits.
> 
> Set the correct bit in the bitmap for each document that contains that
> word. In the end you will have the equivalent 60,000 bitmaps or 2
> million bits.
> 
> During search time, one creates an empty bitmap of 2 million bits as a
> work space. One parses the search term, and performs boolean operation
> on the workspace from the bitmap retrieved for each word.
> 
> When you are done parsing, you have a bitmap with a bit set for each
> document that fits the search criteria. You then enumerate the bits by
> bit position, and you now have a list of document numbers.
> 
> If only I could get the list of document numbers back into
> postgres....... It would be great.

Gotcha. It's impossible to return a set from a function, so the only
way to use perl to parse your bitmap. We did (in one project) external
search using suffix arrays which incredibly fast and use postgres to 
return results to perl for processing.

Regards,
Oleg

> -- 
> http://www.mohawksoft.com
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



pgsql-hackers by date:

Previous
From: "Hiroshi Inoue"
Date:
Subject: RE: Japan pictures
Next
From: mlw
Date:
Subject: Re: OK, does anyone have any better ideas?