Re: website doc search is extremely SLOW - Mailing list pgsql-general

From Oleg Bartunov
Subject Re: website doc search is extremely SLOW
Date
Msg-id Pine.GSO.4.58.0401031707160.11643@ra.sai.msu.su
Whole thread Raw
In response to Re: website doc search is extremely SLOW  ("Marc G. Fournier" <scrappy@postgresql.org>)
Responses Re: website doc search is extremely SLOW  (Dave Cramer <pg@fastcrypt.com>)
Re: website doc search is extremely SLOW  ("Marc G. Fournier" <scrappy@postgresql.org>)
List pgsql-general
Hi there,

I hoped to release pilot version of www.pgsql.ru with full text search
of postgresql related resources (currently we've crawled 27 sites, about
340K pages) but we started celebration NY too early :)
Expect it tomorrow or monday.

We have developed many search engines, some of them are based on
PostgreSQL like tsearch2, OpenFTS and are best to be embedded into
CMS for true online updating. Their power comes from access to documents attributes
stored in database, so one could perform categorized search, restricted
search (different rights, different document status, etc). The most close
example would be search on archive of mailing lists, which should be
embed such kind of full text search engine. fts.postgresql.org in his best
time was one of implementation of such system. This is what I hope to have on
www.pgsql.ru, if Marc will give us access to mailing list archives :)

Another search engines we use are based on standard technology of
inverted indices, they are best suited for indexing of semi-static collections
od documents. We've full-fledged crawler, indexer and searcher. Online
update of inverted indices is rather complex technological task and I'm
not sure there are databases which have true online update. On www.pgsql.ru
we use GTSearch which is generic text search engine we developed for
vertical searches (for example, postgresql related resources). It has
common set of features like phrase search, proximity ranking, site search,
morphology, stemming support, cached documents, spell checking, similar search
etc.

I see several separate tasks:

* official documents (documentation mostly)

 I'm not sure is there are some kind of CMS on www.postgresql.org, but
 if it's there the best way is to embed tsearch2 into CMS. You'll have
 fast, incremental search engine. There are many users of tsearch2 and I think
 embedding isn't very difficult problem. I estimate there are maximum
 10-20K pages of documentation, nothing for tsearch2.

* mailing lists archive

 mailing lists archive, which is constantly growing and
 also required incremental update, so tsearch2 also needed. Nice hardware
 like Marc has described would be more than enough. We have moderate dual
 PIII 1Ggz server and I hope it would be enough.

* postgresql related resources

  I think this task should be solved using standard technique - crawler,
  indexer, searcher. Due to limited number of sites it's possible to
  keep indices more actual than major search engines, for example
  crawl once a week. This is what we currently have on pgsql.ru because
  it doesn't require any permissions and interaction with sites officials.


    Regards,
        Oleg


On Wed, 31 Dec 2003, Marc G. Fournier wrote:

> On Tue, 30 Dec 2003, Joshua D. Drake wrote:
>
> > Hello,
> >
> >  Why are we not using Tsearch2?
>
> Because nobody has built it yet?  Oleg's stuff is nice, but we want
> something that we can build into the existing web sites, not a standalone
> site ...
>
> I keep searching the web hoping someone has come up with a 'tsearch2'
> based search engine that does the spidering, but, unless its sitting right
> in front of my eyes and I'm not seeing it, I haven't found it yet :(
>
> Out of everything I've found so far, mnogosearch is one of the best ... I
> just wish I could figure out where the bottleneck for it was, since, from
> reading their docs, their method of storing the data doesn't appear to be
> particularly off.  I'm tempted to try their caching storage manager, and
> getting away from SQL totally, but I *really* want to showcase PostgreSQL
> on this :(
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

pgsql-general by date:

Previous
From: Oleg Bartunov
Date:
Subject: Re: Mnogosearch (Was: Re: website doc search is ... )
Next
From: Dave Cramer
Date:
Subject: Re: website doc search is extremely SLOW