Re: [GENERAL] Creation of tsearch2 index is very - Mailing list pgsql-performance
From | Oleg Bartunov |
---|---|
Subject | Re: [GENERAL] Creation of tsearch2 index is very |
Date | |
Msg-id | Pine.GSO.4.63.0601211927550.14417@ra.sai.msu.su Whole thread Raw |
In response to | Re: [GENERAL] Creation of tsearch2 index is very slow (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-performance |
On Sat, 21 Jan 2006, Ron wrote: > Perhaps a different approach to this problem is called for: > > _Managing Gigabytes: Compressing and Indexing Documents and Images_ 2ed > Witten, Moffat, Bell > ISBN 1-55860-570-3 > > This is a VERY good book on the subject. > > I'd also suggest looking at the publicly available work on indexing and > searching for search engines like Inktomi (sp?) and Google. > Ron Ron, you completely miss the problem ! We do know MG and other SE. Actually, we've implemented several search engines based on inverted index technology (see, for example, pgsql.ru/db/pgsearch). tsearch2 was designed for online indexing, while keeping inverted index online is rather difficult problem. We do have plan to implement inverted index as an option for large read-only archives, but now we discuss how to organize online index and if possible to optimize current storage for signatures without breaking search performance. > > > At 08:34 AM 1/21/2006, Oleg Bartunov wrote: >> On Sat, 21 Jan 2006, Ron wrote: >> >>> At 07:23 PM 1/20/2006, Tom Lane wrote: >>>> "Steinar H. Gunderson" <sgunderson@bigfoot.com> writes: >>>> > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: >>>> >> It's also worth considering that the entire approach is a heuristic, >>>> >> really --- getting the furthest-apart pair of seeds doesn't guarantee >>>> >> an optimal split as far as I can see. Maybe there's some totally >>>> >> different way to do it. >>>> > For those of us who don't know what tsearch2/gist is trying to >>>> accomplish >>>> > here, could you provide some pointers? :-) >>>> Well, we're trying to split an index page that's gotten full into two >>>> index pages, preferably with approximately equal numbers of items in >>>> each new page (this isn't a hard requirement though). >>> >>> Maybe we are over thinking this. What happens if we do the obvious and >>> just make a new page and move the "last" n/2 items on the full page to the >>> new page? >>> >>> Various forms of "move the last n/2 items" can be tested here: >>> 0= just split the table in half. Sometimes KISS works. O(1). >>> 1= the one's with the highest (or lowest) "x" value. >>> 2= the one's with the highest sum of coordinates (x+y+...= values in the >>> top/bottom n/2 of entries). >>> 3= split the table so that each table has entries whose size_waste values >>> add up to approximately the same value. >>> 4= I'm sure there are others. >>> 1-5 can be done in O(n) time w/o auxiliary data. They can be done in O(1) >>> if we've kept track of the appropriate metric as we've built the current >>> page. >>> >>> >>>> I think the true figure of merit for a split is how often will subsequent >>>> searches have to descend into *both* of the resulting pages as opposed to >>>> just one >>>> --- the less often that is true, the better. I'm not very clear on what >>>> tsearch2 is doing with these bitmaps, but it looks like an upper page's >>>> downlink has the union (bitwise OR) of the one-bits in the values on the >>>> lower page, and you have to visit the lower page if this union has a >>>> nonempty intersection with the set you are looking for. If that's >>>> correct, what you really want is to divide the values so that the unions >>>> of the two sets have minimal overlap ... which seems to me to have little >>>> to do with what the code does at present. >>> I'm not sure what "upper page" and "lower page" mean here? >>> >>> >>>> Teodor, Oleg, can you clarify what's needed here? >>> Ditto. Guys what is the real motivation and purpose for this code? >> >> we want not just split the page by two very distinct parts, but to keep >> resulted signatures which is ORed signature of all signatures in the page >> as much 'sparse' as can. some information available here >> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals >> >> Unfortunately, we're rather busy right now and couldn't be very useful. > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-performance by date: