Thread: what is the best way of storing text+image documents in postgresql

what is the best way of storing text+image documents in postgresql

From
Arash pajoohande
Date:

hello
i have a lot of files in microsoft word format. each file consists of text and images (other text formatting like font is not important).
i want to store this documents in Postgresql, and documents must display on web page when corresponding user requests occurs.

it seems theres 2 way to do that:
1. save .doc documents in bytea columns. and show them with a word reader in web page (disadvantage: it needs a proper .doc reader installed on user computer)
2. convert document to html format and store html code as string (disadvantage: images will remain in file system)

what is your offer? any help highly would be appreciated

thanks in advance
arash

Re: what is the best way of storing text+image documents in postgresql

From
Craig Ringer
Date:
On 06/08/2011 06:13 PM, Arash pajoohande wrote:

> 1. save .doc documents in bytea columns. and show them with a word
> reader in web page (disadvantage: it needs a proper .doc reader
> installed on user computer)

1a: Convert the .doc files to a standard format like PDF that most
browsers can display. That's what I'd do.

> 2. convert document to html format and store html code as string
> (disadvantage: images will remain in file system)

You could always store images as 'bytea' fields on a subtable.

CREATE TABLE doc (
   id serial primary key,
   doc_html text,
   ...
);

CREATE TABLE doc_images (
   id serial primary key,
   doc_id integer references doc(id),
   image_data bytea not null
);

There are advantages and disadvantages to storing files in the database
vs in the file system. It's been discussed to death recently, so see the
mailing list archives.

--
Craig Ringer

Re: what is the best way of storing text+image documents in postgresql

From
John R Pierce
Date:
On 06/08/11 6:06 AM, Craig Ringer wrote:
>> 1. save .doc documents in bytea columns. and show them with a word
>> reader in web page (disadvantage: it needs a proper .doc reader
>> installed on user computer)
>
> 1a: Convert the .doc files to a standard format like PDF that most
> browsers can display. That's what I'd do.

thats harder to integrate with a website in the sense that the PDF
documents are hard page formatted, and can at best be displayed in an
<iframe> within your site, and half the time, only displayed in an
external PDF viewer since browser-pdf integration remains flakey and
bugridden after all these years.   PDF text won't flow to fit your page
layout, etc etc.

one approach to conversion might be to save the documents as an RTF type
format, and run that through a preprocessor that reencodes them as a
clean HTML or similar metalanguage that you can deal with intelligently.
    MS Word's own HTML converter creates wretched HTML with tons of
extra bizarro-world tags which likely will trip up your page formatting
if you display these in context in your pages.

as Craig suggested, images can be stored as bytea objects, and the image
links converted to point to a CGI that fetches them from the database
and presents them to the client browser.

--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast


Re: what is the best way of storing text+image documents in postgresql

From
Tomas Vondra
Date:
Dne 8.6.2011 21:37, John R Pierce napsal(a):
> On 06/08/11 6:06 AM, Craig Ringer wrote:
>>> 1. save .doc documents in bytea columns. and show them with a word
>>> reader in web page (disadvantage: it needs a proper .doc reader
>>> installed on user computer)
>>
>> 1a: Convert the .doc files to a standard format like PDF that most
>> browsers can display. That's what I'd do.
>
> thats harder to integrate with a website in the sense that the PDF
> documents are hard page formatted, and can at best be displayed in an
> <iframe> within your site, and half the time, only displayed in an
> external PDF viewer since browser-pdf integration remains flakey and
> bugridden after all these years.   PDF text won't flow to fit your page
> layout, etc etc.

OTOH, the probability that the visitor has a PDF reader is much higher.
Plus the PDF is usually much easier to index etc. But yes, using this to
build a website is PITA.

> one approach to conversion might be to save the documents as an RTF type
> format, and run that through a preprocessor that reencodes them as a
> clean HTML or similar metalanguage that you can deal with intelligently.
>    MS Word's own HTML converter creates wretched HTML with tons of extra
> bizarro-world tags which likely will trip up your page formatting if you
> display these in context in your pages.

You could as well run htmltidy or something like that on the HTML. But
in both cases, this will seriously damage the formatting. So if the OP
wants to preserve it, the only viable solution is to keep the .doc
format or convert it to a .pdf and pray there's a working viewer.

Or you can convert the PDF into images ("convert" from imagemagick can
do that quite easily), display those images on the web and offer the PDF
for download.

regards
Tomas

Re: what is the best way of storing text+image documents in postgresql

From
Radosław Smogura
Date:
 On Wed, 8 Jun 2011 14:43:16 +0430, Arash pajoohande wrote:
> hello
> i have a lot of files in microsoft word format. each file consists of
> text and images (other text formatting like font is not important).
> i want to store this documents in Postgresql, and documents must
> display on web page when corresponding user requests occurs.
>
> it seems theres 2 way to do that:
> 1. save .doc documents in bytea columns. and show them with a word
> reader in web page (disadvantage: it needs a proper .doc reader
> installed on user computer)
> 2. convert document to html format and store html code as string
> (disadvantage: images will remain in file system)
>
> what is your offer? any help highly would be appreciated
>
> thanks in advance
> arash

 If you want to do 2. there is no problem your application should
 display documents like this
 /browswer/<document_id>/<start>.html

 Then after conversion put all generated files in database, converted
 document (start.html) should be relatively linked, if some browser will
 call to load image big/my_logo.png, then it will do request
 /browswer/<document_id>/html/big/my_logo.png

 Your CGI/Servlet will capture every request after /browser, you have id
 of document and requested file name passed in URL then do
 SELECT content where doc_id=... and filename='....'

 Caching, speed-uping, on demand generation, securing is on your side.

 Regards,
 Radek

Re: what is the best way of storing text+image documents in postgresql

From
Vincent Veyron
Date:
Le mercredi 08 juin 2011 à 14:43 +0430, Arash pajoohande a écrit :

> and documents must display on web page when corresponding user
> requests occurs.

I wonder why that is? is there any reason you can't just store and serve
the binary content as is, simply letting the proper application (Word in
that case) open it?

--
Vincent Veyron
http://marica.fr/
Logiciel de gestion des sinistres et des contentieux pour le service juridique


Re: what is the best way of storing text+image documents in postgresql

From
Arash pajoohande
Date:
hello all and thank you for your advices
On Thu, Jun 9, 2011 at 7:05 PM, Vincent Veyron <vv.lists@wanadoo.fr> wrote:
Le mercredi 08 juin 2011 à 14:43 +0430, Arash pajoohande a écrit :

> and documents must display on web page when corresponding user
> requests occurs.

I wonder why that is? is there any reason you can't just store and serve
the binary content as is, simply letting the proper application (Word in
that case) open it?
that is because MS Word may hasn't been installed in user machine to handle the page.

Actually, my word files consists of many exam questions. for each user, a random subset of questions must selected and displayed.
it seem's that it would be nice if I have each question as an easy to handle document section (e.g. html div) and display them without need of any other application.
for now, i think the best solution for me is to convert each question as an RTF type format, and run that through a preprocessor that reencodes them as a clean HTML (as john r pierce suggested).

thanks in advance
Arash

Re: what is the best way of storing text+image documents in postgresql

From
John R Pierce
Date:
On 06/10/11 12:04 AM, Arash pajoohande wrote:
> Actually, my word files consists of many exam questions. for each
> user, a random subset of questions must selected and displayed.
> it seem's that it would be nice if I have each question as an easy to
> handle document section (e.g. html div) and display them without need
> of any other application.
> for now, i think the best solution for me is to convert each question
> as an RTF type format, and run that through a preprocessor that
> reencodes them as a clean HTML (as john r pierce suggested).


actually, for THAT requirement, I think I'd store the questions in the
database as plain text and wrap them in HTML formatting in your View or
Presentation layer as its being sent to the user....


--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast


Re: what is the best way of storing text+image documents in postgresql

From
Craig Ringer
Date:
On 06/10/2011 03:26 PM, John R Pierce wrote:

> actually, for THAT requirement, I think I'd store the questions in the
> database as plain text and wrap them in HTML formatting in your View or
> Presentation layer as its being sent to the user....

I agree - sending a PDF, Word doc, RTF, etc for that is just horrid.

Consider saveing the Word docs as plain text, then parsing them with a
Perl script or something to split them into questions and answers you
can store in the database.

Even if the script only does a rough job it might do enough of the job
that you could quickly go through and fix them up / check them against
the original documents.

There are almost certainly existing questionnaire / exam scripts or
applications you can re-use for the presentation side.

--
Craig Ringer

Re: what is the best way of storing text+image documents in postgresql

From
Leif Biberg Kristensen
Date:
On Friday 10. June 2011 09.04.18 Arash pajoohande wrote:

> Actually, my word files consists of many exam questions. for each user, a
> random subset of questions must selected and displayed.
> it seem's that it would be nice if I have each question as an easy to
> handle document section (e.g. html div) and display them without need of
> any other application.

That is a very trivial task in eg. PHP. Like John and Craig has said, just
store the questions as plain text in the database. (Watch out for special MS
characters.) Let the PHP script pick a random subset of predefined size and
display it in the browser. Click on File -> Print.

Sounds like less than a day's work to me.

regards, Leif

Re: what is the best way of storing text+image documents in postgresql

From
Vincent Veyron
Date:
Le vendredi 10 juin 2011 à 11:34 +0430, Arash pajoohande a écrit :

>
>         > and documents must display on web page when corresponding
>         user
>         > requests occurs.
>
>
>         I wonder why that is? is there any reason you can't just store
>         and serve
>         the binary content as is, simply letting the proper
>         application (Word in
>         that case) open it?
> that is because MS Word may hasn't been installed in user machine to
> handle the page.
>

Provide a link to LibreOffice?


> for now, i think the best solution for me is to convert each question
> as an RTF type format, and run that through a preprocessor that
> reencodes them as a clean HTML (as john r pierce suggested).

That depends on how clean your documents are : it could be hard to
convert them. But I agree on the principle.


--
Vincent Veyron
http://marica.fr/
Logiciel de gestion des sinistres et des contentieux pour le service juridique