Thread: postgreSQL for storing a database of documents
Is PostgreSQL appropriate for a database of documents? I see several potential problems. 1. Tuples are limited to 8KB. I assume even the "text" type is limited that way. But my documents may be longer than that. 2. I need to be able to search by content. I can use a regular-expression search in a SELECT ... WHERE clause, but I expect that to be exceedingly inefficient. I would prefer a GLIMPSE-like search ability. 3. I will also be conducting searches by some other fields (like author or call number). But only one field can be a primary key. Will selection by other fields be terribly inefficient for large numbers (about 200K) of large tuples (ranging from about 100B to say 10KB)? If you can help, please respond not only to this mailing list but also directly to me. Thanks. Raphael Finkel <raphael@cs.uky.edu>
On Tue, 4 May 1999, Raphael Finkel wrote: # Is PostgreSQL appropriate for a database of documents? I see several # potential problems. # # 1. Tuples are limited to 8KB. I assume even the "text" type is limited # that way. But my documents may be longer than that. This isn't a problem, I store all of my photographs in postgres. # 2. I need to be able to search by content. I can use a # regular-expression search in a SELECT ... WHERE clause, but I expect # that to be exceedingly inefficient. I would prefer a GLIMPSE-like # search ability. This is a problem, unless you use one of the keyword index things people have been talking about. I prefer a glimpse or such in this situation. # 3. I will also be conducting searches by some other fields (like author # or call number). But only one field can be a primary key. Will # selection by other fields be terribly inefficient for large numbers # (about 200K) of large tuples (ranging from about 100B to say 10KB)? Primary key shouldn't be any of these fields, but you can index on whatever you want. This is where the RDBMS is useful, as you can instantly get all documents by a given author and/or on a given date range and/or a given topic. -- SA, beyond.com My girlfriend asked me which one I like better. pub 1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net> | Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98 D8 BE L_______________________ I hope the answer won't upset her. ____________
Raphael Finkel <raphael@cs.uky.edu> wrote: >Is PostgreSQL appropriate for a database of documents? I see several >potential problems. > >1. Tuples are limited to 8KB. I assume even the "text" type is limited that >way. But my documents may be longer than that. This would pretty much nix the idea of using postgres for a document repository. You'd need to split the data into separate rows in another table. I've done this for my bulletin board (though most posts should be less than 8K in my case), but I think it would be even more of a hastle with documents. What you can do instead is store descriptive information in the DB and store the actual documents in the file system with a field in the document table containing the path to the file. A few years ago I worked on a commercial document management program that stored attributes in an Oracle database but kept the actual documents on disk. >2. I need to be able to search by content. I can use a regular-expression >search in a SELECT ... WHERE clause, but I expect that to be exceedingly >inefficient. I would prefer a GLIMPSE-like search ability. If it's stored outside the database then you can use any kind of text searching tool. There's a full text add-on for postgres in contrib/fulltextindex which I've been meaning to install. You could use this to index all of the attributes so people can do full text searches on things like comments and descriptions. When presenting the search results you could merge the results of searching the files with the results of searching in the text fields. >3. I will also be conducting searches by some other fields (like author >or call number). But only one field can be a primary key. Will selection by >other fields be terribly inefficient for large numbers (about 200K) of large >tuples (ranging from about 100B to say 10KB)? You can create as many indexes as you need using create index. -- Ari Halberstadt mailto:ari@shore.net <http://www.magiccookie.com/> PGP public key available at <http://www.magiccookie.com/pgpkey.txt>
Ari Halberstadt wrote: > Raphael Finkel <raphael@cs.uky.edu> wrote: > >Is PostgreSQL appropriate for a database of documents? I see several > >potential problems. > > > >1. Tuples are limited to 8KB. I assume even the "text" type is limited that > >way. But my documents may be longer than that. > > This would pretty much nix the idea of using postgres for a document > repository. You'd need to split the data into separate rows in another > table. I've done this for my bulletin board (though most posts should be > less than 8K in my case), but I think it would be even more of a hastle > with documents. What you can do instead is store descriptive information in > the DB and store the actual documents in the file system with a field in > the document table containing the path to the file. A few years ago I > worked on a commercial document management program that stored attributes > in an Oracle database but kept the actual documents on disk. > That a good solution, but now that XML is starting to break off it kinda puts PostgreSQL users at a disadvantage. We have to have static XML documents on on disk in order to use the format and have external search engines to search for content in the documents. It is not that big of a headache, but is an inconvenience. I will be doing some work on XML soon for my job and they are going straight Oracle because they want to create "searchable" XML documents on the fly to post and distribute on the web. Very LARGE documents. Like the legal code for NYC and such. When Postgre starts using large objects it is going to make a lot of noise. Especially for "start-up" companies. -- Mark Fleming Student ICQ# 16171200 e-mail: mtony@iname.com http://www.globaldrum.com/mfleming http://coolbox.bowiestate.edu/~mfleming
Dustin Sallings wrote: > > On Tue, 4 May 1999, Raphael Finkel wrote: > > # Is PostgreSQL appropriate for a database of documents? I see several > # potential problems. > # > # 1. Tuples are limited to 8KB. I assume even the "text" type is limited > # that way. But my documents may be longer than that. > > This isn't a problem, I store all of my photographs > in postgres. How?
On Tue, 4 May 1999, Mark Fleming wrote: // That a good solution, but now that XML is starting to break off it // kinda puts PostgreSQL users at a disadvantage. We have to have // static XML documents on on disk in order to use the format and have // external search engines to search for content in the documents. It // is not that big of a headache, but is an inconvenience. I will be // doing some work on XML soon for my job and they are going straight // Oracle because they want to create "searchable" XML documents on the // fly to post and distribute on the web. Very LARGE documents. Like // the legal code for NYC and such. When Postgre starts using large // objects it is going to make a lot of noise. Especially for // "start-up" companies. What do you mean when Postgres starts using large objects? It has for quite some time now. I chose not to use it because I don't like the way the data is stored on disk, though. -- Principal Member Technical Staff, beyond.com The world is watching America, pub 1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net> | Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98 D8 BE L______________________________________________ and America is watching TV. __
On Wed, 5 May 1999, Chris Bitmead wrote: // > # 1. Tuples are limited to 8KB. I assume even the "text" type is limited // > # that way. But my documents may be longer than that. // > // > This isn't a problem, I store all of my photographs // > in postgres. // // How? I played with BLOBs, but I didn't like how they were stored, or that they didn't back up, so I went with Base64 encoded data which I split up to store. It's pretty damned quick, and I get a lot more control. -- Principal Member Technical Staff, beyond.com The world is watching America, pub 1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net> | Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98 D8 BE L______________________________________________ and America is watching TV. __
Hi, I'm using Large Object to store images and displayed them on the fly. I never found any problems and it is very quiet. "When Postgre starts using large objects it is going to make a lot of noise." ??? You can also generate XML on the fly or what you want with Perl or other CGI stuff, simply using all your differents fields. Concerning the back-up of BLOB, I have a reference for each object and export each of them in a crontab with their reference as name. Of course OID are lost but you really don't need them ! All works fine again ! I have done a Perl script for regexp search into all text fields stored in my database from a www interface. It's very simple but it also works fine. I do a AND search with all given keyword and if nothing is found I do a OR search. This not very powerfull but I can do better the day I win some time :-) I don't know about search into BLOB. So, I have thousand of hit each month on a P90 with 16 Mo of RAM (really not enought) it's slow but for 6 months now I didn't found any problem and my server log file is empty. I'm doing a little modification now because all images stored as Large Object are not cached by proxies so it's very expansive for my bandwich (64Ko). So I only store the path of the image and save the images as normal files. Gilles.
darold wrote: > I'm using Large Object to store images and displayed them on the fly. I never found > any problems and it > is very quiet. "When Postgre starts using large objects it is going to make a lot > of noise." ??? How many images are you storing? -- Chris Bitmead http://www.bigfoot.com/~chris.bitmead mailto:chris.bitmead@bigfoot.com
Hi, Rigth now I have less than one thousand of gif/jpg around 6 to 15 ko each. I guess I can go upper :-)) regards, Gilles Darold Chris Bitmead wrote: > darold wrote: > > > I'm using Large Object to store images and displayed them on the fly. I never found > > any problems and it > > is very quiet. "When Postgre starts using large objects it is going to make a lot > > of noise." ??? > > How many images are you storing? > > -- > Chris Bitmead > http://www.bigfoot.com/~chris.bitmead > mailto:chris.bitmead@bigfoot.com