Re: web archiving - Mailing list pgsql-novice

From Ron Johnson
Subject Re: web archiving
Date
Msg-id 1026414819.12660.36.camel@rebel
Whole thread Raw
In response to Re: web archiving  (Matt Price <matt.price@utoronto.ca>)
List pgsql-novice
On Thu, 2002-07-11 at 12:52, Matt Price wrote:
> Hi Phialip, et al,
>
> well, wget is nice, and htdig/mngosearch both seem great; but I want to
> be able to enter extra data about the web pages (author names, comments,
> subject/key word entries...)so that the database starts to resemble a
> bibliographic database.  That is, I want other people to be able to take
> advantage of work that I and other data-entry slaves do when we enter
> the url's.
>
> does htat seem silly?

What you could do is use wget to store the html "trees" in individual
directories.  Then you can store the web "metadata" plus location in
the database.  That would minimize the size of the db, plus store the
html in it's "natural habitat": the file system, where it's available
to Apache/etc.

> On Wed, 2002-07-10 at 18:21, Philip Hallstrom wrote:
> > Not to discourage you from using postgresql or writing it yourself, but
> > you might want to take a look at wget (for downloading the web pages) and
> > mngosearch or htdig for searching them.
> >
> > mngosearch supports postgresql and has a PHP interface so you can have fun
> > with that...
> >
> > On 10 Jul 2002, Matt Price wrote:
> >
> > > Hi there,
> > >
> > > I've just moved up from non-free os's to debian linux, and installed
> > > postgresql, with the hope of getting started on some projects I've been
> > > thinking about.  Several of these projects involve web archives.  The
> > > idea is, a url is entered with a bunch of bibliographic-type data in
> > > other fields (keywords, author, date, etc).  The html (and hopefully,
> > > accompanying images/css's/etc) are then grabbed using curl, and archived
> > > in a postgresql database.  A web or other gui interface then provides
> > > fully-searchable access to the archive for later use.
> > >
> > > So my question:  does anyone know of a similar tool which already
> > > exists?  I'm a complete novice at database programming (and at php, too,
> > > which is what I figured I'd use as the scripting language, though I'd
> > > consider learning perl or java if folks think that's a much better
> > > idea), and I'd rather work with some pre-existing code than start from
> > > the ground up.  Any suggestings?  Is this the right list to be asking
> > > this quesiton on?
> > >
> > > Thanks loads,
> > > Matt

--
+-----------------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net             |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81         |
|                                                                 |
| "Experience should teach us to be most on our guard to protect  |
|  liberty when the government's purposes are beneficent. Men     |
|  born to freedom are naturally alert to repel invasion of their |
|  liberty by evil minded rulers. The greatest dangers to liberty |
|  lurk in insidious encroachment by men of zeal, well-meaning    |
|  but without understanding."                                    |
|   Justice Louis Brandeis, dissenting, Olmstead v US (1928)      |
+-----------------------------------------------------------------+


pgsql-novice by date:

Previous
From: Manuel Sugawara
Date:
Subject: Re: Translate problems
Next
From: "Michael Paesold"
Date:
Subject: SET CONSTRAINTS question...