On Tue, 19 Oct 1999, Mike Field wrote:
>
> Hi all-
>
> I have 400 documents in Word for Mac format that I need to enter into my
> database, saving as much formatting as possible (italics, line breaks,
> superscripts for footnote markers, etc...). What is the best way to convert
> this data, and how can I do it automatically?
Perl.
like you said, convert the MS-Word docs into HTML, read each file into a
perl string, do your s///g magic ( dropping <HR> and such ), then use DBI
to insert them into a TEXT field.
or, if you dont want to force the user to save their files as HTML, use
'mswordview' ( freely available MS-Word to HTML convertor ). that'd
require a bit more logic on your end, but im sure your (client|users)
would love you for it. see notes below.
> I'm running Postgres on a Linux server, using PHP.
im sorry.
> Parts of each document go into different fields. For example, a document
> has information about a medicinal plant. The entire Word doc has it's
> scientific name, popular names, usages, dosages, toxicities, bibliography,
> etc., each corresponds to a different field in the database.
again, Perl ( aka text-hacker-upper-language ).
> I figure converting into html ultimately will keep the formatting I need,
> though there's a LOT of garbage in the Word files I DON'T want (horizontal
> lines to make it look nice, etc.).
can you guess what im going to say here ? :)
seriously, though... perl is geared towards text manipulation, so you mind
as well make use of it. if you have all these MS-Word sitting in a
directory, just do an opendir() and readdir(), cycling through all the
files, doing your sed magic, and inserting into pgsql.
if you want to allow the user to upload a MS-Word doc, the same logic
applies: get the file, read it, do your sed stuff, insert it. there's a
perl module that handles webbased fileuploads -- CGI.pm or CGI_LIB.pl
should cover it. the billing system i wrote for a local ISP makes use of
webbased file uploading to handle credit card billing via ICVerify; the
cgi sends a file to the user, user runs icverify with said file, user
uploads output of icverify via a ~250 line perl script, perl script parses
the file and makes the appropriate entries to the db.
but anyway... if you want to allow your users to upload the MS-Word doc
itself, use mswordview. what youd want to do here is grab the data from
the form and write to a local file. run mswordview on that file,
opendir() the directory that mswordview created, parse the files and
insert the appropriate ones into the db. or you could 'borrow' parts of
mswordview, assuming its modular, and do the ms-word -> html conversion
on the fly ((c)1999 Steve Jobs).
> Any suggestions from the experts would be great!
>
> Thanks,
> Mike
>
> mike@fieldco.com
> I am Canadian :-)
i'm sorry. :P
---
Howie <caffeine@toodarkpark.org> URL: http://www.toodarkpark.org
"Just think how much deeper the ocean would be if sponges didn't live there."