Re: website doc search is extremely SLOW - Mailing list pgsql-general

From Dave Cramer
Subject Re: website doc search is extremely SLOW
Date
Msg-id 1073143578.1662.71.camel@localhost.localdomain
Whole thread Raw
In response to Re: website doc search is extremely SLOW  (Oleg Bartunov <oleg@sai.msu.su>)
List pgsql-general
On Sat, 2004-01-03 at 09:49, Oleg Bartunov wrote:
> Hi there,
>
> I hoped to release pilot version of www.pgsql.ru with full text search
> of postgresql related resources (currently we've crawled 27 sites, about
> 340K pages) but we started celebration NY too early :)
> Expect it tomorrow or monday.
Fantastic!
>
> We have developed many search engines, some of them are based on
> PostgreSQL like tsearch2, OpenFTS and are best to be embedded into
> CMS for true online updating. Their power comes from access to documents attributes
> stored in database, so one could perform categorized search, restricted
> search (different rights, different document status, etc). The most close
> example would be search on archive of mailing lists, which should be
> embed such kind of full text search engine. fts.postgresql.org in his best
> time was one of implementation of such system. This is what I hope to have on
> www.pgsql.ru, if Marc will give us access to mailing list archives :)

I too would like access to the archives.

>
> Another search engines we use are based on standard technology of
> inverted indices, they are best suited for indexing of semi-static collections
> od documents. We've full-fledged crawler, indexer and searcher. Online
> update of inverted indices is rather complex technological task and I'm
> not sure there are databases which have true online update. On www.pgsql.ru
> we use GTSearch which is generic text search engine we developed for
> vertical searches (for example, postgresql related resources). It has
> common set of features like phrase search, proximity ranking, site search,
> morphology, stemming support, cached documents, spell checking, similar search
> etc.
>
> I see several separate tasks:
>
> * official documents (documentation mostly)
>
>  I'm not sure is there are some kind of CMS on www.postgresql.org, but
>  if it's there the best way is to embed tsearch2 into CMS. You'll have
>  fast, incremental search engine. There are many users of tsearch2 and I think
>  embedding isn't very difficult problem. I estimate there are maximum
>  10-20K pages of documentation, nothing for tsearch2.

A content management system is long overdue I think, do you have any
good recommendations?

>
> * mailing lists archive
>
>  mailing lists archive, which is constantly growing and
>  also required incremental update, so tsearch2 also needed. Nice hardware
>  like Marc has described would be more than enough. We have moderate dual
>  PIII 1Ggz server and I hope it would be enough.
>
> * postgresql related resources
>
>   I think this task should be solved using standard technique - crawler,
>   indexer, searcher. Due to limited number of sites it's possible to
>   keep indices more actual than major search engines, for example
>   crawl once a week. This is what we currently have on pgsql.ru because
>   it doesn't require any permissions and interaction with sites officials.
>
>
>     Regards,
>         Oleg
>
>
> On Wed, 31 Dec 2003, Marc G. Fournier wrote:
>
> > On Tue, 30 Dec 2003, Joshua D. Drake wrote:
> >
> > > Hello,
> > >
> > >  Why are we not using Tsearch2?
> >
> > Because nobody has built it yet?  Oleg's stuff is nice, but we want
> > something that we can build into the existing web sites, not a standalone
> > site ...
> >
> > I keep searching the web hoping someone has come up with a 'tsearch2'
> > based search engine that does the spidering, but, unless its sitting right
> > in front of my eyes and I'm not seeing it, I haven't found it yet :(
> >
> > Out of everything I've found so far, mnogosearch is one of the best ... I
> > just wish I could figure out where the bottleneck for it was, since, from
> > reading their docs, their method of storing the data doesn't appear to be
> > particularly off.  I'm tempted to try their caching storage manager, and
> > getting away from SQL totally, but I *really* want to showcase PostgreSQL
> > on this :(
> >
> > ----
> > Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> > Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> >     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> >
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>
--
Dave Cramer
519 939 0336
ICQ # 1467551


pgsql-general by date:

Previous
From: Oleg Bartunov
Date:
Subject: Re: website doc search is extremely SLOW
Next
From: Baldur Norddahl
Date:
Subject: Re: why the need for is null?