Re: A counter productive conversation about search. - Mailing list pgsql-www

From Oleg Bartunov
Subject Re: A counter productive conversation about search.
Date
Msg-id Pine.GSO.4.63.0608290857420.16344@ra.sai.msu.su
Whole thread Raw
In response to A counter productive conversation about search.  ("Joshua D. Drake" <jd@commandprompt.com>)
Responses Re: A counter productive conversation about
Re: A counter productive conversation about search.
List pgsql-www
Hi there,

On Mon, 28 Aug 2006, Joshua D. Drake wrote:

>
> I have on multiple occasions brought up the idea of another search engine. I
> wrote the pgsql.ru guys and asked if they would share their code. To their
> benefit they said they would be willing but didn't have the time to install
> it for us. I told them I would be happy to muscle through it if they would
> just answer some emails. I never heard back.

Joshua, we'd be happy to help PostgreSQL community and actually we tried
in past developing pgsql.ru, but we have families and we're in situation we
need money to live.  We don't want to promise something we could break.
On pgsql.ru we have 2 search engines, one is a commercial version which
crawl pages, index them and provide search. I and Teodor are not the only
owners, so there is a problem with it. Also, I don't like the idea to
use it, since it's not fully online indexing. The second SE, based on
tsearch2, is what we actually needed. Several years ago (fts.postgresql.org)
tsearch2 was slow, but now, when we have GiN support I see no
real problem to have fully online indexing. We plan to renew pgsql.ru after
releasing 8.2 and then we'll see how it's working.

Another problem, is how documents are getting indexed. We have special user,
called robot, which subscribed to almost all mailing list, and procmail
entry instructed to process incoming message using our CMS. This worked
nice and allows to be fully in sync. Of course, we depend on what messages
come to the robot. This is not a problem on arhives.postgresql.org, which
has full control on the mailing lists.

To index wwww.postgresql.org I see two alternatives:
1. periodically run script, which crawl the site
2. Have a real CMS with hook to indexer.

I suspect, that second way is a complex thing for the current state of art,
so I'd stay with the first one. Giving, that documentation changed slow and
only news pages require indexing, it's not a bad approximation.

Hmm,  looks like a mess :( The entire system needs to be rewritten !

It's my opinion, that without understanding what to index/search and
financial support current thread is useless. Do we have any financing
for that ?


Regards,
     Oleg

btw, we have simple crawler for OpenFTS, available from
http://openfts.sourceforge.net/contributions.shtml
Using it, it's possible to write simple script to index collections of
documents, like documentation. See examples on
http://mira.sai.msu.su/~megera/pgsql
















>
> Other options include lucene, and rolling our own.
>
> Rolling our own really wouldn't be that hard "if" we can create a reasonably
> smart web page grabber. We have all the tools (tsearch2 and pg_pgtrm) to
> easily do the searches.
>
> So is anyone up for helping develop a page grabber?
>
> Sincerely,
>
> Joshua D. Drake
>
>
>
>
>
>
>
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

pgsql-www by date:

Previous
From: "John Hansen"
Date:
Subject: Re: Search out of sync
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Search out of sync