Re: website doc search is extremely SLOW - Mailing list pgsql-general
From | Dave Cramer |
---|---|
Subject | Re: website doc search is extremely SLOW |
Date | |
Msg-id | 1073143578.1662.71.camel@localhost.localdomain Whole thread Raw |
In response to | Re: website doc search is extremely SLOW (Oleg Bartunov <oleg@sai.msu.su>) |
List | pgsql-general |
On Sat, 2004-01-03 at 09:49, Oleg Bartunov wrote: > Hi there, > > I hoped to release pilot version of www.pgsql.ru with full text search > of postgresql related resources (currently we've crawled 27 sites, about > 340K pages) but we started celebration NY too early :) > Expect it tomorrow or monday. Fantastic! > > We have developed many search engines, some of them are based on > PostgreSQL like tsearch2, OpenFTS and are best to be embedded into > CMS for true online updating. Their power comes from access to documents attributes > stored in database, so one could perform categorized search, restricted > search (different rights, different document status, etc). The most close > example would be search on archive of mailing lists, which should be > embed such kind of full text search engine. fts.postgresql.org in his best > time was one of implementation of such system. This is what I hope to have on > www.pgsql.ru, if Marc will give us access to mailing list archives :) I too would like access to the archives. > > Another search engines we use are based on standard technology of > inverted indices, they are best suited for indexing of semi-static collections > od documents. We've full-fledged crawler, indexer and searcher. Online > update of inverted indices is rather complex technological task and I'm > not sure there are databases which have true online update. On www.pgsql.ru > we use GTSearch which is generic text search engine we developed for > vertical searches (for example, postgresql related resources). It has > common set of features like phrase search, proximity ranking, site search, > morphology, stemming support, cached documents, spell checking, similar search > etc. > > I see several separate tasks: > > * official documents (documentation mostly) > > I'm not sure is there are some kind of CMS on www.postgresql.org, but > if it's there the best way is to embed tsearch2 into CMS. You'll have > fast, incremental search engine. There are many users of tsearch2 and I think > embedding isn't very difficult problem. I estimate there are maximum > 10-20K pages of documentation, nothing for tsearch2. A content management system is long overdue I think, do you have any good recommendations? > > * mailing lists archive > > mailing lists archive, which is constantly growing and > also required incremental update, so tsearch2 also needed. Nice hardware > like Marc has described would be more than enough. We have moderate dual > PIII 1Ggz server and I hope it would be enough. > > * postgresql related resources > > I think this task should be solved using standard technique - crawler, > indexer, searcher. Due to limited number of sites it's possible to > keep indices more actual than major search engines, for example > crawl once a week. This is what we currently have on pgsql.ru because > it doesn't require any permissions and interaction with sites officials. > > > Regards, > Oleg > > > On Wed, 31 Dec 2003, Marc G. Fournier wrote: > > > On Tue, 30 Dec 2003, Joshua D. Drake wrote: > > > > > Hello, > > > > > > Why are we not using Tsearch2? > > > > Because nobody has built it yet? Oleg's stuff is nice, but we want > > something that we can build into the existing web sites, not a standalone > > site ... > > > > I keep searching the web hoping someone has come up with a 'tsearch2' > > based search engine that does the spidering, but, unless its sitting right > > in front of my eyes and I'm not seeing it, I haven't found it yet :( > > > > Out of everything I've found so far, mnogosearch is one of the best ... I > > just wish I could figure out where the bottleneck for it was, since, from > > reading their docs, their method of storing the data doesn't appear to be > > particularly off. I'm tempted to try their caching storage manager, and > > getting away from SQL totally, but I *really* want to showcase PostgreSQL > > on this :( > > > > ---- > > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) > > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 2: you can get off all lists at once with the unregister command > > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > Sternberg Astronomical Institute, Moscow University (Russia) > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(095)939-16-83, +007(095)939-23-83 > -- Dave Cramer 519 939 0336 ICQ # 1467551
pgsql-general by date: