Re: multi terabyte fulltext searching - Mailing list pgsql-general

From Oleg Bartunov
Subject Re: multi terabyte fulltext searching
Date
Msg-id Pine.LNX.4.64.0703211908400.12152@sn.sai.msu.ru
Whole thread Raw
In response to Re: multi terabyte fulltext searching  (Benjamin Arai <benjamin@araisoft.com>)
Responses Re: multi terabyte fulltext searching  (Benjamin Arai <benjamin@araisoft.com>)
List pgsql-general
On Wed, 21 Mar 2007, Benjamin Arai wrote:

> Hi Oleg,
>
> I am currently using GIST indexes because I receive about 10GB of new data a
> week (then again I am not deleting any information).  The do not expect to be
> able to stop receiving text for about 5 years, so the data is not going to
> become static any time soon.  The reason I am concerned with performance is
> that I am providing a search system for several newspapers since essentially
> the beginning of time.  Many bibliographer etc would like to use this utility
> but if each search takes too long I am not going to be able to support many
> concurrent users.
>

GiST is ok for your feed, but archive part should use GIN index.
inheritance+CE should help your life.

> Benjamin
>
> On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote:
>
>> Benjamin,
>>
>> as one of the author of tsearch2 I'd like to know more about your setup.
>> tsearch2 in 8.2 has GIN index support, which scales much better than old
>> GiST index.
>>
>> Oleg
>>
>> On Wed, 21 Mar 2007, Benjamin Arai wrote:
>>
>>> Hi,
>>>
>>> I have been struggling with getting fulltext searching for very large
>>> databases.  I can fulltext index 10s if gigs without any problem but when
>>> I start geting to hundreds of gigs it becomes slow.  My current system is
>>> a quad core with 8GB of memory.  I have the resource to throw more
>>> hardware at it but realistically it is not cost effective to buy a system
>>> with 128GB of memory.  Is there any solutions that people have come up
>>> with for indexing very large text databases?
>>>
>>> Essentially I have several terabytes of text that I need to index.  Each
>>> record is about 5 paragraphs of text.  I am currently using TSearch2
>>> (stemming and etc) and getting sub-optimal results.  Queries take more
>>> than a second to execute.  Has anybody implemented such a database using
>>> multiple systems or some special add-on to TSearch2 to make things faster?
>>> I want to do something like partitioning the data into multiple systems
>>> and merging the ranked results at some master node.  Is something like
>>> this possible for PostgreSQL or must it be a software solution?
>>>
>>> Benjamin
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 9: In versions below 8.0, the planner will ignore your desire to
>>>    choose an index scan if your joining column's datatypes do not
>>>    match
>>
>>     Regards,
>>         Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

pgsql-general by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: multi terabyte fulltext searching
Next
From: Tom Lane
Date:
Subject: Re: Anyone still using the sql_inheritance parameter?