Thread: web archiving

web archiving

From
Matt Price
Date:
Hi there,

I've just moved up from non-free os's to debian linux, and installed
postgresql, with the hope of getting started on some projects I've been
thinking about.  Several of these projects involve web archives.  The
idea is, a url is entered with a bunch of bibliographic-type data in
other fields (keywords, author, date, etc).  The html (and hopefully,
accompanying images/css's/etc) are then grabbed using curl, and archived
in a postgresql database.  A web or other gui interface then provides
fully-searchable access to the archive for later use.

So my question:  does anyone know of a similar tool which already
exists?  I'm a complete novice at database programming (and at php, too,
which is what I figured I'd use as the scripting language, though I'd
consider learning perl or java if folks think that's a much better
idea), and I'd rather work with some pre-existing code than start from
the ground up.  Any suggestings?  Is this the right list to be asking
this quesiton on?

Thanks loads,
Matt


Re: web archiving

From
Philip Hallstrom
Date:
Not to discourage you from using postgresql or writing it yourself, but
you might want to take a look at wget (for downloading the web pages) and
mngosearch or htdig for searching them.

mngosearch supports postgresql and has a PHP interface so you can have fun
with that...

On 10 Jul 2002, Matt Price wrote:

> Hi there,
>
> I've just moved up from non-free os's to debian linux, and installed
> postgresql, with the hope of getting started on some projects I've been
> thinking about.  Several of these projects involve web archives.  The
> idea is, a url is entered with a bunch of bibliographic-type data in
> other fields (keywords, author, date, etc).  The html (and hopefully,
> accompanying images/css's/etc) are then grabbed using curl, and archived
> in a postgresql database.  A web or other gui interface then provides
> fully-searchable access to the archive for later use.
>
> So my question:  does anyone know of a similar tool which already
> exists?  I'm a complete novice at database programming (and at php, too,
> which is what I figured I'd use as the scripting language, though I'd
> consider learning perl or java if folks think that's a much better
> idea), and I'd rather work with some pre-existing code than start from
> the ground up.  Any suggestings?  Is this the right list to be asking
> this quesiton on?
>
> Thanks loads,
> Matt
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>


Re: web archiving

From
Matt Price
Date:
Hi Phialip, et al,

well, wget is nice, and htdig/mngosearch both seem great; but I want to
be able to enter extra data about the web pages (author names, comments,
subject/key word entries...)so that the database starts to resemble a
bibliographic database.  That is, I want other people to be able to take
advantage of work that I and other data-entry slaves do when we enter
the url's.

does htat seem silly?

matt

On Wed, 2002-07-10 at 18:21, Philip Hallstrom wrote:
> Not to discourage you from using postgresql or writing it yourself, but
> you might want to take a look at wget (for downloading the web pages) and
> mngosearch or htdig for searching them.
>
> mngosearch supports postgresql and has a PHP interface so you can have fun
> with that...
>
> On 10 Jul 2002, Matt Price wrote:
>
> > Hi there,
> >
> > I've just moved up from non-free os's to debian linux, and installed
> > postgresql, with the hope of getting started on some projects I've been
> > thinking about.  Several of these projects involve web archives.  The
> > idea is, a url is entered with a bunch of bibliographic-type data in
> > other fields (keywords, author, date, etc).  The html (and hopefully,
> > accompanying images/css's/etc) are then grabbed using curl, and archived
> > in a postgresql database.  A web or other gui interface then provides
> > fully-searchable access to the archive for later use.
> >
> > So my question:  does anyone know of a similar tool which already
> > exists?  I'm a complete novice at database programming (and at php, too,
> > which is what I figured I'd use as the scripting language, though I'd
> > consider learning perl or java if folks think that's a much better
> > idea), and I'd rather work with some pre-existing code than start from
> > the ground up.  Any suggestings?  Is this the right list to be asking
> > this quesiton on?
> >
> > Thanks loads,
> > Matt
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster



Re: web archiving

From
Ron Johnson
Date:
On Thu, 2002-07-11 at 12:52, Matt Price wrote:
> Hi Phialip, et al,
>
> well, wget is nice, and htdig/mngosearch both seem great; but I want to
> be able to enter extra data about the web pages (author names, comments,
> subject/key word entries...)so that the database starts to resemble a
> bibliographic database.  That is, I want other people to be able to take
> advantage of work that I and other data-entry slaves do when we enter
> the url's.
>
> does htat seem silly?

What you could do is use wget to store the html "trees" in individual
directories.  Then you can store the web "metadata" plus location in
the database.  That would minimize the size of the db, plus store the
html in it's "natural habitat": the file system, where it's available
to Apache/etc.

> On Wed, 2002-07-10 at 18:21, Philip Hallstrom wrote:
> > Not to discourage you from using postgresql or writing it yourself, but
> > you might want to take a look at wget (for downloading the web pages) and
> > mngosearch or htdig for searching them.
> >
> > mngosearch supports postgresql and has a PHP interface so you can have fun
> > with that...
> >
> > On 10 Jul 2002, Matt Price wrote:
> >
> > > Hi there,
> > >
> > > I've just moved up from non-free os's to debian linux, and installed
> > > postgresql, with the hope of getting started on some projects I've been
> > > thinking about.  Several of these projects involve web archives.  The
> > > idea is, a url is entered with a bunch of bibliographic-type data in
> > > other fields (keywords, author, date, etc).  The html (and hopefully,
> > > accompanying images/css's/etc) are then grabbed using curl, and archived
> > > in a postgresql database.  A web or other gui interface then provides
> > > fully-searchable access to the archive for later use.
> > >
> > > So my question:  does anyone know of a similar tool which already
> > > exists?  I'm a complete novice at database programming (and at php, too,
> > > which is what I figured I'd use as the scripting language, though I'd
> > > consider learning perl or java if folks think that's a much better
> > > idea), and I'd rather work with some pre-existing code than start from
> > > the ground up.  Any suggestings?  Is this the right list to be asking
> > > this quesiton on?
> > >
> > > Thanks loads,
> > > Matt

--
+-----------------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net             |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81         |
|                                                                 |
| "Experience should teach us to be most on our guard to protect  |
|  liberty when the government's purposes are beneficent. Men     |
|  born to freedom are naturally alert to repel invasion of their |
|  liberty by evil minded rulers. The greatest dangers to liberty |
|  lurk in insidious encroachment by men of zeal, well-meaning    |
|  but without understanding."                                    |
|   Justice Louis Brandeis, dissenting, Olmstead v US (1928)      |
+-----------------------------------------------------------------+