> -----Original Message-----
> From: Oleg Bartunov [mailto:oleg@sai.msu.su]
> Sent: 14 July 2006 13:48
> To: Dave Page
> Cc: Magnus Hagander; pgsql-www@postgresql.org
> Subject: RE: [pgsql-www] Web team meeting minutes
>
> I just wanted to say, that current search is not designed for
> Web site indexing.
Err, from the site:
ASPseek is an Internet search engine software developed by SWsoft and
licensed as free software under GNU GPL.
ASPseek consists of an indexing robot, a search daemon, and a CGI search
frontend. It can index as many as a few million URLs and search for
words and phrases, use wildcards, and do a Boolean search. Search
results can be limited to time period given, site or Web space (set of
sites) and sorted by relevance (PageRank is used) or date.
> Search, for example, latest news title "Open Technology
> Group, Inc. announces
> plPHP training" and you'll get nothing ! And will not be
> searched until new
> index gets build. This is exactly why we've developed
> tsearch2 - online
> indexing. If documents are in database, then requirement is just setup
> tsearch2, if not - then you need sort of openfts.
Actually our port of Aspseek can do online indexing - John added an XML
feed in which you can directly insert index data (he used to use it to
accept catalogue feeds from online resellers iirc). The problem is that
we don't have any way to stream the data off the website in that way, so
we still end up crawling anyway.
I do appreciate your point though, and if anyone can come up with a way
to stream data from the website (perhaps just as part of the static
build process) then it might be worth looking at. Archives would have
the same problems I guess - whilst it would be easy enough index mail
messages online, you have no way of knowing what the URL on
archives.postgresql.org would be at that point, unless we fundamentally
redesigned the entire archives site to run from the database.
Regards, Dave.