Thread: web archiving
Hi there, I've just moved up from non-free os's to debian linux, and installed postgresql, with the hope of getting started on some projects I've been thinking about. Several of these projects involve web archives. The idea is, a url is entered with a bunch of bibliographic-type data in other fields (keywords, author, date, etc). The html (and hopefully, accompanying images/css's/etc) are then grabbed using curl, and archived in a postgresql database. A web or other gui interface then provides fully-searchable access to the archive for later use. So my question: does anyone know of a similar tool which already exists? I'm a complete novice at database programming (and at php, too, which is what I figured I'd use as the scripting language, though I'd consider learning perl or java if folks think that's a much better idea), and I'd rather work with some pre-existing code than start from the ground up. Any suggestings? Is this the right list to be asking this quesiton on? Thanks loads, Matt
Not to discourage you from using postgresql or writing it yourself, but you might want to take a look at wget (for downloading the web pages) and mngosearch or htdig for searching them. mngosearch supports postgresql and has a PHP interface so you can have fun with that... On 10 Jul 2002, Matt Price wrote: > Hi there, > > I've just moved up from non-free os's to debian linux, and installed > postgresql, with the hope of getting started on some projects I've been > thinking about. Several of these projects involve web archives. The > idea is, a url is entered with a bunch of bibliographic-type data in > other fields (keywords, author, date, etc). The html (and hopefully, > accompanying images/css's/etc) are then grabbed using curl, and archived > in a postgresql database. A web or other gui interface then provides > fully-searchable access to the archive for later use. > > So my question: does anyone know of a similar tool which already > exists? I'm a complete novice at database programming (and at php, too, > which is what I figured I'd use as the scripting language, though I'd > consider learning perl or java if folks think that's a much better > idea), and I'd rather work with some pre-existing code than start from > the ground up. Any suggestings? Is this the right list to be asking > this quesiton on? > > Thanks loads, > Matt > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
Hi Phialip, et al, well, wget is nice, and htdig/mngosearch both seem great; but I want to be able to enter extra data about the web pages (author names, comments, subject/key word entries...)so that the database starts to resemble a bibliographic database. That is, I want other people to be able to take advantage of work that I and other data-entry slaves do when we enter the url's. does htat seem silly? matt On Wed, 2002-07-10 at 18:21, Philip Hallstrom wrote: > Not to discourage you from using postgresql or writing it yourself, but > you might want to take a look at wget (for downloading the web pages) and > mngosearch or htdig for searching them. > > mngosearch supports postgresql and has a PHP interface so you can have fun > with that... > > On 10 Jul 2002, Matt Price wrote: > > > Hi there, > > > > I've just moved up from non-free os's to debian linux, and installed > > postgresql, with the hope of getting started on some projects I've been > > thinking about. Several of these projects involve web archives. The > > idea is, a url is entered with a bunch of bibliographic-type data in > > other fields (keywords, author, date, etc). The html (and hopefully, > > accompanying images/css's/etc) are then grabbed using curl, and archived > > in a postgresql database. A web or other gui interface then provides > > fully-searchable access to the archive for later use. > > > > So my question: does anyone know of a similar tool which already > > exists? I'm a complete novice at database programming (and at php, too, > > which is what I figured I'd use as the scripting language, though I'd > > consider learning perl or java if folks think that's a much better > > idea), and I'd rather work with some pre-existing code than start from > > the ground up. Any suggestings? Is this the right list to be asking > > this quesiton on? > > > > Thanks loads, > > Matt > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
On Thu, 2002-07-11 at 12:52, Matt Price wrote: > Hi Phialip, et al, > > well, wget is nice, and htdig/mngosearch both seem great; but I want to > be able to enter extra data about the web pages (author names, comments, > subject/key word entries...)so that the database starts to resemble a > bibliographic database. That is, I want other people to be able to take > advantage of work that I and other data-entry slaves do when we enter > the url's. > > does htat seem silly? What you could do is use wget to store the html "trees" in individual directories. Then you can store the web "metadata" plus location in the database. That would minimize the size of the db, plus store the html in it's "natural habitat": the file system, where it's available to Apache/etc. > On Wed, 2002-07-10 at 18:21, Philip Hallstrom wrote: > > Not to discourage you from using postgresql or writing it yourself, but > > you might want to take a look at wget (for downloading the web pages) and > > mngosearch or htdig for searching them. > > > > mngosearch supports postgresql and has a PHP interface so you can have fun > > with that... > > > > On 10 Jul 2002, Matt Price wrote: > > > > > Hi there, > > > > > > I've just moved up from non-free os's to debian linux, and installed > > > postgresql, with the hope of getting started on some projects I've been > > > thinking about. Several of these projects involve web archives. The > > > idea is, a url is entered with a bunch of bibliographic-type data in > > > other fields (keywords, author, date, etc). The html (and hopefully, > > > accompanying images/css's/etc) are then grabbed using curl, and archived > > > in a postgresql database. A web or other gui interface then provides > > > fully-searchable access to the archive for later use. > > > > > > So my question: does anyone know of a similar tool which already > > > exists? I'm a complete novice at database programming (and at php, too, > > > which is what I figured I'd use as the scripting language, though I'd > > > consider learning perl or java if folks think that's a much better > > > idea), and I'd rather work with some pre-existing code than start from > > > the ground up. Any suggestings? Is this the right list to be asking > > > this quesiton on? > > > > > > Thanks loads, > > > Matt -- +-----------------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA http://ronandheather.dhs.org:81 | | | | "Experience should teach us to be most on our guard to protect | | liberty when the government's purposes are beneficent. Men | | born to freedom are naturally alert to repel invasion of their | | liberty by evil minded rulers. The greatest dangers to liberty | | lurk in insidious encroachment by men of zeal, well-meaning | | but without understanding." | | Justice Louis Brandeis, dissenting, Olmstead v US (1928) | +-----------------------------------------------------------------+