Thread: postgreSQL for storing a database of documents

postgreSQL for storing a database of documents

From
raphael@cs.uky.edu (Raphael Finkel)
Date:
Is PostgreSQL appropriate for a database of documents?  I see several
potential problems.

1.  Tuples are limited to 8KB.  I assume even the "text" type is limited that
way.  But my documents may be longer than that.

2.  I need to be able to search by content.  I can use a regular-expression
search in a SELECT ... WHERE clause, but I expect that to be exceedingly
inefficient.  I would prefer a GLIMPSE-like search ability.

3.  I will also be conducting searches by some other fields (like author
or call number).  But only one field can be a primary key.  Will selection by
other fields be terribly inefficient for large numbers (about 200K) of large
tuples (ranging from about 100B to say 10KB)?

If you can help, please respond not only to this mailing list but also
directly to me.  Thanks.

Raphael Finkel <raphael@cs.uky.edu>

Re: [GENERAL] postgreSQL for storing a database of documents

From
Dustin Sallings
Date:
On Tue, 4 May 1999, Raphael Finkel wrote:

# Is PostgreSQL appropriate for a database of documents?  I see several
# potential problems.
#
# 1.  Tuples are limited to 8KB.  I assume even the "text" type is limited
# that way.  But my documents may be longer than that.

    This isn't a problem, I store all of my photographs in postgres.

# 2.  I need to be able to search by content.  I can use a
# regular-expression search in a SELECT ... WHERE clause, but I expect
# that to be exceedingly inefficient.  I would prefer a GLIMPSE-like
# search ability.

    This is a problem, unless you use one of the keyword index things
people have been talking about.  I prefer a glimpse or such in this
situation.

# 3.  I will also be conducting searches by some other fields (like author
# or call number).  But only one field can be a primary key.  Will
# selection by other fields be terribly inefficient for large numbers
# (about 200K) of large tuples (ranging from about 100B to say 10KB)?

    Primary key shouldn't be any of these fields, but you can index on
whatever you want.  This is where the RDBMS is useful, as you can
instantly get all documents by a given author and/or on a given date range
and/or a given topic.

--
SA, beyond.com           My girlfriend asked me which one I like better.
pub  1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
|    Key fingerprint =  87 02 57 08 02 D0 DA D6  C8 0F 3E 65 51 98 D8 BE
L_______________________ I hope the answer won't upset her. ____________


Re: [GENERAL] postgreSQL for storing a database of documents

From
Ari Halberstadt
Date:
Raphael Finkel <raphael@cs.uky.edu> wrote:
>Is PostgreSQL appropriate for a database of documents?  I see several
>potential problems.
>
>1.  Tuples are limited to 8KB.  I assume even the "text" type is limited that
>way.  But my documents may be longer than that.

This would pretty much nix the idea of using postgres for a document
repository. You'd need to split the data into separate rows in another
table. I've done this for my bulletin board (though most posts should be
less than 8K in my case), but I think it would be even more of a hastle
with documents. What you can do instead is store descriptive information in
the DB and store the actual documents in the file system with a field in
the document table containing the path to the file. A few years ago I
worked on a commercial document management program that stored attributes
in an Oracle database but kept the actual documents on disk.

>2.  I need to be able to search by content.  I can use a regular-expression
>search in a SELECT ... WHERE clause, but I expect that to be exceedingly
>inefficient.  I would prefer a GLIMPSE-like search ability.

If it's stored outside the database then you can use any kind of text
searching tool. There's a full text add-on for postgres in
contrib/fulltextindex which I've been meaning to install. You could use
this to index all of the attributes so people can do full text searches on
things like comments and descriptions. When presenting the search results
you could merge the results of searching the files with the results of
searching in the text fields.

>3.  I will also be conducting searches by some other fields (like author
>or call number).  But only one field can be a primary key.  Will selection by
>other fields be terribly inefficient for large numbers (about 200K) of large
>tuples (ranging from about 100B to say 10KB)?

You can create as many indexes as you need using create index.

-- Ari Halberstadt mailto:ari@shore.net <http://www.magiccookie.com/>
PGP public key available at <http://www.magiccookie.com/pgpkey.txt>



Re: [GENERAL] postgreSQL for storing a database of documents

From
Mark Fleming
Date:

Ari Halberstadt wrote:

> Raphael Finkel <raphael@cs.uky.edu> wrote:
> >Is PostgreSQL appropriate for a database of documents?  I see several
> >potential problems.
> >
> >1.  Tuples are limited to 8KB.  I assume even the "text" type is limited that
> >way.  But my documents may be longer than that.
>
> This would pretty much nix the idea of using postgres for a document
> repository. You'd need to split the data into separate rows in another
> table. I've done this for my bulletin board (though most posts should be
> less than 8K in my case), but I think it would be even more of a hastle
> with documents. What you can do instead is store descriptive information in
> the DB and store the actual documents in the file system with a field in
> the document table containing the path to the file. A few years ago I
> worked on a commercial document management program that stored attributes
> in an Oracle database but kept the actual documents on disk.
>

That a good solution, but now that XML is starting to break off it kinda puts
PostgreSQL users at a disadvantage.  We have to have static XML documents on on
disk in order to use the format and have external search engines to search for
content in the documents.  It is not that big of a headache, but is an
inconvenience.  I will be doing some work on XML soon for my job and they are
going straight Oracle because they want to create "searchable" XML documents on
the fly to post and distribute on the web.  Very LARGE documents.  Like the legal
code for NYC and such.  When Postgre starts using large objects it is going to
make a lot of noise.  Especially for "start-up" companies.

--
Mark Fleming Student ICQ# 16171200
e-mail: mtony@iname.com
http://www.globaldrum.com/mfleming
http://coolbox.bowiestate.edu/~mfleming



Re: [GENERAL] postgreSQL for storing a database of documents

From
Chris Bitmead
Date:
Dustin Sallings wrote:
>
> On Tue, 4 May 1999, Raphael Finkel wrote:
>
> # Is PostgreSQL appropriate for a database of documents?  I see several
> # potential problems.
> #
> # 1.  Tuples are limited to 8KB.  I assume even the "text" type is limited
> # that way.  But my documents may be longer than that.
>
>         This isn't a problem, I store all of my photographs
> in postgres.

How?

Re: [GENERAL] postgreSQL for storing a database of documents

From
dustin sallings
Date:
On Tue, 4 May 1999, Mark Fleming wrote:

// That a good solution, but now that XML is starting to break off it
// kinda puts PostgreSQL users at a disadvantage.  We have to have
// static XML documents on on disk in order to use the format and have
// external search engines to search for content in the documents.  It
// is not that big of a headache, but is an inconvenience.  I will be
// doing some work on XML soon for my job and they are going straight
// Oracle because they want to create "searchable" XML documents on the
// fly to post and distribute on the web.  Very LARGE documents.  Like
// the legal code for NYC and such.  When Postgre starts using large
// objects it is going to make a lot of noise.  Especially for
// "start-up" companies.

    What do you mean when Postgres starts using large objects?  It has
for quite some time now.  I chose not to use it because I don't like the
way the data is stored on disk, though.

--
Principal Member Technical Staff, beyond.com    The world is watching America,
pub  1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
|    Key fingerprint =  87 02 57 08 02 D0 DA D6  C8 0F 3E 65 51 98 D8 BE
L______________________________________________ and America is watching TV. __


Re: [GENERAL] postgreSQL for storing a database of documents

From
dustin sallings
Date:
On Wed, 5 May 1999, Chris Bitmead wrote:

// > # 1.  Tuples are limited to 8KB.  I assume even the "text" type is limited
// > # that way.  But my documents may be longer than that.
// >
// >         This isn't a problem, I store all of my photographs
// > in postgres.
//
// How?

    I played with BLOBs, but I didn't like how they were stored, or
that they didn't back up, so I went with Base64 encoded data which I split
up to store.  It's pretty damned quick, and I get a lot more control.

--
Principal Member Technical Staff, beyond.com    The world is watching America,
pub  1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
|    Key fingerprint =  87 02 57 08 02 D0 DA D6  C8 0F 3E 65 51 98 D8 BE
L______________________________________________ and America is watching TV. __


Re: [GENERAL] postgreSQL for storing a database of documents

From
darold
Date:
Hi,

I'm using Large Object to store images and displayed them on the fly. I never found
any problems and it
is very quiet. "When Postgre starts using large objects it is going to make a lot
of noise." ???

You can also generate XML on the fly or what you want with Perl or other CGI stuff,
simply using all your
differents fields.

Concerning the back-up of BLOB, I have a reference for each object and export each
of them in a crontab
with their reference as name. Of course OID are lost but you really don't need them
! All works fine again !

I have done a Perl script for regexp search into all text fields stored in my
database from a www interface.
It's very simple but it also works fine. I do a AND search with all given keyword
and if nothing is found I
do a OR search. This not very powerfull but I can do better the day I win some time
:-) I don't know about
search into BLOB.

So, I have thousand of hit each month on a P90 with 16 Mo of RAM (really not
enought) it's slow but
for 6 months now I didn't found any problem and my server log file is empty.

I'm doing a little modification now because all images stored as Large Object are
not cached by proxies
so it's very expansive for my bandwich (64Ko). So I only store the path of the
image and save the images
as normal files.

Gilles.


Re: [GENERAL] postgreSQL for storing a database of documents

From
Chris Bitmead
Date:
darold wrote:

> I'm using Large Object to store images and displayed them on the fly. I never found
> any problems and it
> is very quiet. "When Postgre starts using large objects it is going to make a lot
> of noise." ???

How many images are you storing?

--
Chris Bitmead
http://www.bigfoot.com/~chris.bitmead
mailto:chris.bitmead@bigfoot.com

Re: [GENERAL] postgreSQL for storing a database of documents

From
darold
Date:
Hi,

Rigth now I have less than one thousand of gif/jpg around 6 to 15 ko each. I guess I can
go upper :-))

regards,

Gilles Darold

Chris Bitmead wrote:

> darold wrote:
>
> > I'm using Large Object to store images and displayed them on the fly. I never found
> > any problems and it
> > is very quiet. "When Postgre starts using large objects it is going to make a lot
> > of noise." ???
>
> How many images are you storing?
>
> --
> Chris Bitmead
> http://www.bigfoot.com/~chris.bitmead
> mailto:chris.bitmead@bigfoot.com