Re: A counter productive conversation about search. - Mailing list pgsql-www
From | Oleg Bartunov |
---|---|
Subject | Re: A counter productive conversation about search. |
Date | |
Msg-id | Pine.GSO.4.63.0608290857420.16344@ra.sai.msu.su Whole thread Raw |
In response to | A counter productive conversation about search. ("Joshua D. Drake" <jd@commandprompt.com>) |
Responses |
Re: A counter productive conversation about
Re: A counter productive conversation about search. |
List | pgsql-www |
Hi there, On Mon, 28 Aug 2006, Joshua D. Drake wrote: > > I have on multiple occasions brought up the idea of another search engine. I > wrote the pgsql.ru guys and asked if they would share their code. To their > benefit they said they would be willing but didn't have the time to install > it for us. I told them I would be happy to muscle through it if they would > just answer some emails. I never heard back. Joshua, we'd be happy to help PostgreSQL community and actually we tried in past developing pgsql.ru, but we have families and we're in situation we need money to live. We don't want to promise something we could break. On pgsql.ru we have 2 search engines, one is a commercial version which crawl pages, index them and provide search. I and Teodor are not the only owners, so there is a problem with it. Also, I don't like the idea to use it, since it's not fully online indexing. The second SE, based on tsearch2, is what we actually needed. Several years ago (fts.postgresql.org) tsearch2 was slow, but now, when we have GiN support I see no real problem to have fully online indexing. We plan to renew pgsql.ru after releasing 8.2 and then we'll see how it's working. Another problem, is how documents are getting indexed. We have special user, called robot, which subscribed to almost all mailing list, and procmail entry instructed to process incoming message using our CMS. This worked nice and allows to be fully in sync. Of course, we depend on what messages come to the robot. This is not a problem on arhives.postgresql.org, which has full control on the mailing lists. To index wwww.postgresql.org I see two alternatives: 1. periodically run script, which crawl the site 2. Have a real CMS with hook to indexer. I suspect, that second way is a complex thing for the current state of art, so I'd stay with the first one. Giving, that documentation changed slow and only news pages require indexing, it's not a bad approximation. Hmm, looks like a mess :( The entire system needs to be rewritten ! It's my opinion, that without understanding what to index/search and financial support current thread is useless. Do we have any financing for that ? Regards, Oleg btw, we have simple crawler for OpenFTS, available from http://openfts.sourceforge.net/contributions.shtml Using it, it's possible to write simple script to index collections of documents, like documentation. See examples on http://mira.sai.msu.su/~megera/pgsql > > Other options include lucene, and rolling our own. > > Rolling our own really wouldn't be that hard "if" we can create a reasonably > smart web page grabber. We have all the tools (tsearch2 and pg_pgtrm) to > easily do the searches. > > So is anyone up for helping develop a page grabber? > > Sincerely, > > Joshua D. Drake > > > > > > > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83