On Tue, 29 Aug 2006, Tino Wildenhain wrote:
> Joshua D. Drake wrote:
> ...
>> Rolling our own really wouldn't be that hard "if" we can create a
>> reasonably smart web page grabber. We have all the tools (tsearch2 and
>> pg_pgtrm) to easily do the searches.
>>
>> So is anyone up for helping develop a page grabber?
>
> Thats not the hardest part but why do we need to grab if the contents
> of the pages could be in the database? But admittedly, I don't know
> any good CMS w/ postgresql backend. But anyway, grabbing the sources
> of the pages while they are published (like the docbook stuff
> for the documentation) makes a lot more sense imho. Ditto for the
> archives. Its much easier to get an idea of the structure and nature
> of the data when you dont have to deal with the final result (e.g. HTML)
>
> So a couple of scripts that fire when mail comes in, documentation
> is compiled and when some other publishing takes place could
> really help to keep the index in sync w/o having to crawl all sites
> over and over again.
This is exactly what we have on pgsql.ru/db/mw. We use procmail to fire
our backend to process incoming message. This is not a problem, the
most complex thing is a backend.
>
> Regards
> Tino Wildenhain
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83