Thread: electronic-izing unicode texts

electronic-izing unicode texts

From
"A. Cropi"
Date:
hi everyone,

i have several hundred books that were typed using unicode and would
liek to put them into a database so that i can perform searches on
them.  how does one design a database for this?

i was planning to make a table with these columns: ID, Title, Authors,
Publishers, Content

the Content column will contain the entire book in unicode; then, to
find out which books contain the string "blah" i'd just do somethig
like select * from table where content contains "blah"

my problem is: (1) i have never done database work before (2) i do not
have any experience in anything like this

my objectives: (1) allow users to make query through the web (i guess
i will do this via PHP interacting with the postgresql)

my questions are: (1) is it reasonable to put the bookcontent into the
CONTENT column? (2) the content of the book can be very long (some of
them have nearly 1 milloin words), so, what kind of considerations
should i be making? (3) how should i design something like this? there
must be someone outthere that has done somethign similar to this.. if
so, please share your experiences.

note: these texts are not copyrighted.. so i do not have to worry
about the legal problems.

tia

Re: electronic-izing unicode texts

From
Richard Huxton
Date:
A. Cropi wrote:
> my objectives: (1) allow users to make query through the web (i guess
> i will do this via PHP interacting with the postgresql)
>
> my questions are: (1) is it reasonable to put the bookcontent into the
> CONTENT column? (2) the content of the book can be very long (some of
> them have nearly 1 milloin words), so, what kind of considerations
> should i be making? (3) how should i design something like this? there
> must be someone outthere that has done somethign similar to this.. if
> so, please share your experiences.

You might be better off with a web-indexing package.
   http://freshmeat.net/search/?q=web+indexing§ion=projects

Since you're not structuring the content of the book, most of the
advantages of a RDBMS don't apply. If you're going to treat it as text,
just use one of the text indexing systems above.

I would convert each book into one or more web-pages (perhaps one page
per section/chapter) and then use htdig or swish.

--
   Richard Huxton
   Archonet Ltd