Thread: html to postgres...

html to postgres...

From
Tony Grant
Date:
Hello,

I have several hundred HTML (!) pages that need to be converted to a
format that in turn can be imported into PostgreSQL. They are all built
on a very similar grid.

Any thoughts?

Cheers

Tony Grant

--
RedHat Linux on Sony Vaio C1XD/S
http://www.animaproductions.com/linux2.html
Macromedia UltraDev with PostgreSQL
http://www.animaproductions.com/ultra.html


Re: html to postgres...

From
Tony Grant
Date:
On 16 Jul 2001 11:07:55 -0400, Mitch Vincent wrote:
> You could put the entire HTML page directly into a text type field in
> PG..... That would give you limited flexibility as far as searching and
> indexing goes but you didn't mention any specifics of what you were
> attempting to do by having the pages in a database....

Yes I was vague - the heat is coming back...

These are film and director pages in a movie site. I am looking at
HTML->XML tools then with a parser I should be able to create a tab
delimited text file.

The objective is now that we will be moving from hundreds to thousands
of pages a database generated site seem more reasonable...

Cheers

Tony

--
RedHat Linux on Sony Vaio C1XD/S
http://www.animaproductions.com/linux2.html
Macromedia UltraDev with PostgreSQL
http://www.animaproductions.com/ultra.html


Re: html to postgres...

From
"Mitch Vincent"
Date:
You could put the entire HTML page directly into a text type field in
PG..... That would give you limited flexibility as far as searching and
indexing goes but you didn't mention any specifics of what you were
attempting to do by having the pages in a database....

Good luck!

-Mitch

----- Original Message -----
From: "Tony Grant" <tony@animaproductions.com>
To: <pgsql-general@postgresql.org>
Sent: Monday, July 16, 2001 10:48 AM
Subject: [GENERAL] html to postgres...


> Hello,
>
> I have several hundred HTML (!) pages that need to be converted to a
> format that in turn can be imported into PostgreSQL. They are all built
> on a very similar grid.
>
> Any thoughts?
>
> Cheers
>
> Tony Grant
>
> --
> RedHat Linux on Sony Vaio C1XD/S
> http://www.animaproductions.com/linux2.html
> Macromedia UltraDev with PostgreSQL
> http://www.animaproductions.com/ultra.html
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>


Re: html to postgres...

From
markMLl.pgsql-general@telemetry.co.uk
Date:
I'm successfully storing
--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or
colleagues]

Re: html to postgres...

From
Jose Manuel Lorenzo Lopez
Date:
16.07.2001 16:48:53, Tony Grant <tony@animaproductions.com> wrote:

>Hello,
>
>I have several hundred HTML (!) pages that need to be converted to a
>format that in turn can be imported into PostgreSQL. They are all built
>on a very similar grid.
>
>Any thoughts?

Hello Tony,

I guess you want to insert the data of tables included in your HTML
Pages, am I right???

If so, may be you want to create a Shell-Script (Perl ???) that reads the
rows of your tables in the HTML-Pages using the HTML  tags
<th></th>  <td></th>as field separator.

You also can convert the HTML-pages first in human readable ascii - tables :)
and than read this output with a Script (Again Perl ???). The tools to do
this is 'html2text'.

Best Regards  / Mit freundlichen Gruß / Un saludo

José Manuel Lorenzo López





Re: html to postgres...

From
Vince Vielhaber
Date:
On 16 Jul 2001, Tony Grant wrote:

> Hello,
>
> I have several hundred HTML (!) pages that need to be converted to a
> format that in turn can be imported into PostgreSQL. They are all built
> on a very similar grid.
>
> Any thoughts?

For the integrated docs I wrote a C program (I suck at perl) and a
shell script to send to stdout:

\connect whateverdatabaseyouwant
insert into table(pagereference,textofpage) values('titleofpage','

The parse the file and send it to stdout but take any apostrophes and
backslashes and escape them.  Then follow it up with:

');\n

and when run it took a directory of about 480 files and neatly filed
them into a table.

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH    email: vev@michvhf.com    http://www.pop4.net
         56K Nationwide Dialup from $16.00/mo at Pop4 Networking
        Online Campground Directory    http://www.camping-usa.com
       Online Giftshop Superstore    http://www.cloudninegifts.com
==========================================================================




Re: html to postgres...

From
"Mitch Vincent"
Date:
> Yes I was vague - the heat is coming back...

Still somewhat vague, though we're getting there!

> These are film and director pages in a movie site. I am looking at
> HTML->XML tools then with a parser I should be able to create a tab
> delimited text file.

    Ok, it seems that you're going to have to write something to do the
inserting into the database as you're creating a custom schema and such I
assume.. A few thousand web pages with a few fields shouldn't take very long
at all to import, I'm guessing that it won't be all that much data...
Something quick in Perl/C or even PHP would work after you got the
individual HTML files parsed into your comma delimited file.

Assuming you want to parse these pages into fields (name, descriptions,
whatever else) and that seems to me to be the hardest thing to do especially
if the pages weren't written with that in mind... What (XML?) tool do you
intend on using to parse out these fields and how will it know what goes in
what field? Have you written the pages in a way so that you can
programatically decide everything you need to?

    I probably didn't tell you anything you didn't already know... Sorry if
it wasn't any help :-)

> The objective is now that we will be moving from hundreds to thousands
> of pages a database generated site seem more reasonable...

    I see.. Well, it looks like you're probably going to have to write
something to do the parsing for you and after that is done, inserting it
into the database is cake.

    Good luck!

-Mitch




Re: html to postgres...

From
Tony Grant
Date:
On 16 Jul 2001 08:34:46 -0700, Pete Leonard wrote:
>
> The other option, assuming that the pages are consistent, is that you
> roll-your-own perl script to parse the pages & handle the inserts to the
> database - do you have anyone available to you with perl knowledge?

No... That's why I'm looking around for tools.

I have found a couple of tools - non free. I used to have a tool for
BBEdit on the Mac that did this but I can't find it on my backup
CD-ROMS... Must be the ones that are at home.

Cheers

Tony

--
RedHat Linux on Sony Vaio C1XD/S
http://www.animaproductions.com/linux2.html
Macromedia UltraDev with PostgreSQL
http://www.animaproductions.com/ultra.html


Re: html to postgres...

From
markMLl.pgsql-general@telemetry.co.uk
Date:
Tony Grant wrote:
>
> These are film and director pages in a movie site. I am looking at
> HTML->XML tools then with a parser I should be able to create a tab
> delimited text file.

I'm successfully storing scripts in tables which are pulled and executed
on a client system, works well so far except that if you are using
Win-32 ODBC you must have the latest version since otherwise each will
be chopped at 8K without warning.

--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or
colleagues]

Re: html to postgres...

From
"Richard Huxton"
Date:
From: "Tony Grant" <tony@animaproductions.com>

> On 16 Jul 2001 11:07:55 -0400, Mitch Vincent wrote:
> > You could put the entire HTML page directly into a text type field in
> > PG..... That would give you limited flexibility as far as searching and
> > indexing goes but you didn't mention any specifics of what you were
> > attempting to do by having the pages in a database....
>
> Yes I was vague - the heat is coming back...
>
> These are film and director pages in a movie site. I am looking at
> HTML->XML tools then with a parser I should be able to create a tab
> delimited text file.

Did something similar myself a while ago.

Assuming the pages were all generated from a template originally, I found
the following the simplest.

Construct your database structure, create some test data.
Create output system (db=>xml=>html whatever)
Build a (set of) perl script(s) to parse the HTML and strip the data out (if
your pages are anything like mine, they're not *identical* formats so you'll
end up needing something custom-built).
Push the data into PostgreSQL.
Publish the website from the database
Run a "diff" of the old and new pages
Tweak system as required and repeat until satisfied everything works.

The key problem I found was that unless the pages were generated from a
database to start with, they all seemed to have minor changes. The only way
I could be satisfied I'd not missed data was to publish and compare. The
first couple of "diff"s were scary, and I ended up cutting and pasting a few
pieces manually, but I got everything out.

HTH

- Richard Huxton


Re: html to postgres...

From
Tony Grant
Date:
Thanks to all for the input.

Because of the limits of my knowledge of perl, C etc...

I'll do a batch job from BBEdit. I have found a Tidy plugin for BBEdit,
installed the latest version of BBEdit Lite and the HTML tools. All I
need now is to write up the description for the XML output.

While it may sound strange that I'll do this from a Mac it is the tool I
am most familiar with for this type of task. From 1994 till 1998 I used
BBEdit as content creation, ftp and data preformatting tool for SQL.

Cheers

Tony

--
RedHat Linux on Sony Vaio C1XD/S
http://www.animaproductions.com/linux2.html
Macromedia UltraDev with PostgreSQL
http://www.animaproductions.com/ultra.html


Re: html to postgres...

From
V Alex Brennen
Date:
On 16 Jul 2001, Tony Grant wrote:

> Hello,
>
> I have several hundred HTML (!) pages that need to be converted to a
> format that in turn can be imported into PostgreSQL. They are all built
> on a very similar grid.
>
> Any thoughts?

I've written code to store ASCII text in pg Oids.  I also have
code that pulls out the ASCII text and renders it as part of a
web page. The use is the storage of radix encoded openPGP Public
Keys for a public key server.  The code is all written in C.


You can get the code from:
http://www.cryptnet.net/fsp/cks/

And you can see the keyserver running (with about 120,000 keys
in the db) at:
http://keyserver.cryptnet.net/


    - VAB
---
V. Alex Brennen      <vab@cryptnet.net>
[ http://www.cryptnet.net/people/vab/ ]
[ http://www.advogato.org/person/vab/ ]