Thread: To BLOB Or Not To BLOB
A while ago it was being held that the Postgres large object data type was too new and not sufficiently tested and mature to be used in a production environment. I am about to deploy a little database that involves storing large-ish text files (20-500k) which could be either done as large objects or as pointers to those files that would be then stored as ordinary files in the OS's filesystem. I am undecided as to this question. What are the pros and cons? What is the performance in either case vis-a-vis the other? It is a web app; the interface is done in PHP. Beginning from which version is the large object interface (if at all) to be considered stable and ready for production? cheers frank -- frank joerdens joerdens new media heinrich-roller str. 16/17 10405 berlin germany e: frank@joerdens.de t: +49 30 44055471 f: +49 30 44055475 h: http://www.joerdens.de pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc
On Sun, 16 Apr 2000, Frank Joerdens wrote: > A while ago it was being held that the Postgres large object data type > was too new and not sufficiently tested and mature to be used in a > production environment. I am about to deploy a little database that > involves storing large-ish text files (20-500k) which could be either done > as large objects or as pointers to those files that would be then stored > as ordinary files in the OS's filesystem. I am undecided as to this > question. What are the pros and cons? What is the performance in either > case vis-a-vis the other? It is a web app; the interface is done in PHP. > Beginning from which version is the large object interface (if at all) > to be considered stable and ready for production? > How are you going to dump and restore the large objects? The pg_dump and pg_dumpall programs will only handle the tables it seems. There are some programs out there to dump large objects and I've been playing with one. It's worked well so far. You can get it at ftp://ftp2.zf.jcu.cz/zakkr/pg/ I'd like to know what other ways there are to handle dumping/restoring large objects. Isn't there any standard way that comes with PostgreSQL? Nevertheless, I've been using php and large objects with no problems so far. I've been able to fully dump the database with all large objects, upgrade PostgreSQL to a new version, initdb, and restore everything. Reading large objects seems fairly fast too. On my website, all data is stored in the database, including pictures. The version that is up and running reads the data out of the database on every request! Since I use query string urls like ?image=2345, web browsers don't cache it much. Its not so slow if you don't have a lot of traffic, but if you are planning for a lot of traffic, try designing for on-the-fly content generation for a speed up. That is what I am doing on the devel version of my site. This way the data is only read out of the database on the first request then html and image files are generated on the filesystem to serve the same request next time unless a query string is passed which overrides and forces content regeneration. Its a little tricky but cool. I'd say go for storing everything in the database so long as you master how to dump and restore large objects as well as the databases/tables using pg_dump/pg_dumpall. This way, it is possible to really separate your site code from its data. All data backed up in the database and its dumps, all code (php etc) backed up in a CVS repository. Html pages and images generated on the filesystem as an ordinary looking website can be seen as just temporary cached data that can be sent out quickly when it matches a request. Look into mod_rewrite on apache. If you want a simpler site that serves fast, then just use the file pointer idea and keep the files outside the database. Robert Easter reaster@comptechnews.com
Hi, It seems that the issue with large objects is "Why do you want the info in a database?" It seems to me that the point of a database is its ability to order and relate data. If you want to retrieve the "large-ish text files" based on their content then I think you need to have the files in the database so they can be searched. However, if you are going to retrieve the files based on something about them that is stored as a separate attribute, such as their title, then I think that storing the files as large objects is unnecessary. If you have the option to "[do] as pointers t those files" then I think you are not intending to use any of Postgres's features beyond its ability to store and retrieve data. Surely the file system itself is 'better' at this regardless of how stable Postgres LO's are. I would sure like to hear someone else's opinion on this because it seems that quite a few folks are using large objects and I am often curious about why. John Henderson >A while ago it was being held that the Postgres large object data type >was too new and not sufficiently tested and mature to be used in a >production environment. I am about to deploy a little database that >involves storing large-ish text files (20-500k) which could be either done >as large objects or as pointers to those files that would be then stored >as ordinary files in the OS's filesystem. I am undecided as to this >question. What are the pros and cons? What is the performance in either >case vis-a-vis the other? It is a web app; the interface is done in PHP. >Beginning from which version is the large object interface (if at all) >to be considered stable and ready for production? > >cheers frank > >-- >frank joerdens > >joerdens new media >heinrich-roller str. 16/17 >10405 berlin >germany > >e: frank@joerdens.de >t: +49 30 44055471 >f: +49 30 44055475 >h: http://www.joerdens.de > >pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc >
Well I'm currently using the file system for large files. However because of that I can see a few reasons why people might want to use Postgresql to handle them. Others can probably mention more. Using Pg to handle large stuff makes more consistent overall and it's easier for you to handle exceptions - e.g. if things fail the whole thing is rolled back, and you theoretically don't get the large files dangling around. Well ok you probably do until the next vacuum, but at least you don't have to write your own vacuum to handle that ;). Basically you shift the problem to Pg (and the very fine developers :) ). The reasons I decided to go file system were: 1) I'm using Linux and ext2fs has a 2GB limit on files, and it seems like 6.5.3 tables are stored as single files, so better not go down that path :). 2) I'm using Perl, DBI etc and a brief look at BLOB handling put me off. Maybe it was unwarranted, but given 1) I decided to call the whole thing off. Cheerio, Link.
> A while ago it was being held that the Postgres large object data type > was too new and not sufficiently tested and mature to be used in a > production environment. I am about to deploy a little database that > involves storing large-ish text files (20-500k) which could be either done > as large objects or as pointers to those files that would be then stored > as ordinary files in the OS's filesystem. I am undecided as to this > question. What are the pros and cons? What is the performance in either > case vis-a-vis the other? It is a web app; the interface is done in PHP. > Beginning from which version is the large object interface (if at all) > to be considered stable and ready for production? > I would always use a reference solution. It's database independent and if the contents you store is URL, then you may easily get the referenced text file. Marten
-> It seems that the issue with large objects is "Why do you want the info in a -> database?" To organize them, of course. -> It seems to me that the point of a database is its ability to order and -> relate data. If you want to retrieve the "large-ish text files" based on -> their content then I think you need to have the files in the database so -> they can be searched. However, if you are going to retrieve the files based -> on something about them that is stored as a separate attribute, such as -> their title, then I think that storing the files as large objects is -> unnecessary. Partly I would like to retrieve them based on their content, and it is why I wrote (badly) a small extension that searches large objects for text strings. -> If you have the option to "[do] as pointers t those files" then I think you -> are not intending to use any of Postgres's features beyond its ability to -> store and retrieve data. Surely the file system itself is 'better' at this -> regardless of how stable Postgres LO's are. Yes, but the file system is external to the database. A view that I am beginning to appreciate the more that I use databases like PG and Oracle is that all of the relations within a database should be internal. Want to reference a title? Reference to the table of titles. Want to reference a bunch of data? Point to the bunch of data. If I have to use the large object interface to do this, then I will. This way, you can let the database worry about organizing and storing the data, and all you need to do is worry about backing that database up -- not about making sure that the permissions on files are right, or that the directory structure remains the same, or the DB remains synced to the contents of the files... This is also why the FOREIGN KEY functionality that e.g. MySQL completely lacks and PostgreSQL is going to have as of 7.0 is so important - maintaining referential integrity. Here are two specific applications for which I am using large objects, one in Oracle and one in PG, both involving online databases, both using the ArsDigita Community System. First of all, I managed the online paper submission and review process for a conference using Oracle. For this, it was very convenient to be able to grab submissions and stuff them into the database, from which they could be referenced by review etc. Moreover, because of the geographical distribution of the review committee, the ability to grab the papers without giving access to the machine by shell or by FTP (which is generally a security problem) was excellent. Finally, separate revisions of the papers could be kept separate using the obvious internal mechanisms of the database. The second application is one I'm working on right now, a genome database project for the sea urchin. For this, we need to be able to archive arbitrarily large amounts of file-oriented data by (for example) sequence name, and we want to give the option to search it as well. The only really convenient way to do this is to store it all in the database. (Note that it's not terribly *fast* to do it this way ;). I would be happy to discuss this more if anyone has suggestions for alternatives -- I'm not wedded to the approach, but it seemed to be the obvious one. cheers, --titus
How do I start this? There are so many different ways to look at this question. Do I save files inside the database, or do I simply use the database as a tool to query information about a given file, it's properties, etc. Instead of making general statements, let me pick specific examples. Let's say we want to create an MP3 repository. Our goals are to be able to query the information saved in ID3 tags, to locate, organize, manipulate, or play the MP3. To narrow the scope of discussion, let's say I will be accessing these files locally, on my filesystem or a NFS mounted directory. What would be the most efficient way for us to query the information found in the ID3 tags? We could parse each file in a number of directories and their subdirectories for the information we require each time we need it, or we could parse it once and save the information in a database structured around ID3 information. I would say the later is definitely more desireable and acceptable. Inside the database, the ID3 tags can be indexed and our searches optimized by the information we're requesting. But should we store the mp3's inside the database? I say no. A few observations: * Filesystems are generally better at allocating space for files than databases are. * Storing files inside databases renders shell access or direct access to files as impossible. If the database goes South, so does access to the files. Now, our mp3 decoders are shell programs that read mp3 input by opening a file, a URL connection, or reading it from standard input (that is if we have a good mp3 player). Yet, in order for these mp3 decoders to be able to play an mp3 found inside the database, we are forced to use a specialized database interface. It may return the mp3 file to the decoder via standard input, or it may export the file to a temp directory and feed the mp3 player the location of that temp file. Yet, why should we go through all of this trouble? And why should we tie the operation of two clearly separate applications into such inflexible relationships? We add overhead to the database to output this file, and we add another level of logic to get the file to the decoder. A level of logic that could prove the downfall of a would-be MP3 mobile disc jockey, who's worked so hard to please the Wicked Witch of the Wedding. He'd be forced to pull out the CD's and tapes he so dreads -- if he remembered to bring them along. If we keep file access decoupled from gleaning catagorizable, quantifiable data about/from the file, we allow two separate and efficient tools to do their jobs independently. If file access is your concern, then use a file system that will give you the level of control you desire to develop your security policies. If grabbing information from a data file is important, such as parsing text files for content, then build indexes to the data inside the file and store those indexes in the database. Internet directories are excellent examples of this! We don't neuter local file access by the applications that need the data files, and we still provide access to searchable information. I would argue that access to a data file and querying sets of data files for information are two separate layers of logic. Logical layers are best served when they are given tools to interact, but are not crippled by tight integration. Saving data files inside a database, IMHO, is an example of tight integration, and something to be avoided if possible -- at least in this case. -- Chad "^chewie, gunnarr" Walstrom <chewie@wookimus.net> http://wookimus.net/chewie
G'day. I do a lot of work with the BASIS textual/multi-media RDBMS package and run into this question all the time. There is one pretty basic answer: If you leave BLOBS lying around in the file system - particularly if it is a Novell etc file system - people move them and the links get broken. Just today I had an example where several thousand links broke because some dumb Novel administrator moved a directory holding files which were linked to from a database rather than stored in it. (Not my design decision) A database that can hold the BLOBS internally avoids this sort of snafu. (BASIS also has the advantage that it can automagically index almost any form of BLOB to the word/phrase/sentence level if it is stored in the database.) Cheers, Stephen Davies "John Henderson" <jrh@is.com.fj> wrote: > Hi, > It seems that the issue with large objects is "Why do you want the info in a > database?" > > It seems to me that the point of a database is its ability to order and > relate data. If you want to retrieve the "large-ish text files" based on > their content then I think you need to have the files in the database so > they can be searched. However, if you are going to retrieve the files based > on something about them that is stored as a separate attribute, such as > their title, then I think that storing the files as large objects is > unnecessary. > > If you have the option to "[do] as pointers t those files" then I think you > are not intending to use any of Postgres's features beyond its ability to > store and retrieve data. Surely the file system itself is 'better' at this > regardless of how stable Postgres LO's are. > > I would sure like to hear someone else's opinion on this because it seems > that quite a few folks are using large objects and I am often curious about > why. > > John Henderson > > >A while ago it was being held that the Postgres large object data type > >was too new and not sufficiently tested and mature to be used in a > >production environment. I am about to deploy a little database that > >involves storing large-ish text files (20-500k) which could be either done > >as large objects or as pointers to those files that would be then stored > >as ordinary files in the OS's filesystem. I am undecided as to this > >question. What are the pros and cons? What is the performance in either > >case vis-a-vis the other? It is a web app; the interface is done in PHP. > >Beginning from which version is the large object interface (if at all) > >to be considered stable and ready for production? > > > >cheers frank > > > >-- > >frank joerdens > > > >joerdens new media > >heinrich-roller str. 16/17 > >10405 berlin > >germany > > > >e: frank@joerdens.de > >t: +49 30 44055471 > >f: +49 30 44055475 > >h: http://www.joerdens.de > > > >pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc > > ======================================================================== Stephen Davies Consulting scldad@sdc.com.au Adelaide, South Australia. Voice: 08-8177 1595 Computing & Network solutions. Fax: 08-8177 0133
On Mon, Apr 17, 2000 at 06:12:37PM +0930, Stephen Davies wrote: > G'day. > > I do a lot of work with the BASIS textual/multi-media RDBMS package and > run into this question all the time. > > There is one pretty basic answer: > > If you leave BLOBS lying around in the file system - particularly if > it is a Novell etc file system - people move them and the links get > broken. > Just today I had an example where several thousand links broke because > some dumb Novel administrator moved a directory holding files which > were linked to from a database rather than stored in it. (Not my design > decision) The problem is solving why stuff breaks if they are left on the filesystem. Here are a couple of suggestions for what in my opinion should be done when leaving data like this on the filesystem. 1. have a separate mirrored filesystem for your file data so that this kind of data is not competing with other data/logs etc. It also makes it more obvious that this data is important. 2. Have a cleanup policy if the filesystem grows alot, or introduc monitor levels where you assign new disk. 3. scatter the file system with a README file saying why the files are important and what is using them and what cleanup policy is in place. Also tell them who to ask questions to, like an email list etc. 4. Use relative path names rather than full to reference the data this way if stuff has to move it is easier to update. The problem with program data, is that sometimes it is hard to tell what is important and what isn't if you are not close to the application or as the application gets older, so stuff gets moved and things break. > A database that can hold the BLOBS internally avoids this sort of snafu. I think that depending on what you need, it is very valid to implement both ways. Cheers, Tom.
On Mon, Apr 17, 2000 at 06:12:37PM +0930, Stephen Davies wrote: > G'day. > > I do a lot of work with the BASIS textual/multi-media RDBMS package and run > into this question all the time. > > There is one pretty basic answer: > > If you leave BLOBS lying around in the file system - particularly if it is a > Novell etc file system - people move them and the links get broken. I counter that people would not be able to move these files if you had used a solid file permissions policy. For this type of application, you must not give write/change permissions to ANYONE other then the database operator itself, the uid/gid reserved for the application. Follow a strict policy and you won't have problems. Create enough "helper" programs to recover from broken links, and you're fine. * CRC each file * Use a 'find' program to parse directories and match file names. * Check the CRC (md5sums would work nicely) * Have a DB routine run periodically to find and fix broken links. This would also help you find multiple copies of a file on your system, perhaps help you clean up extreaneous junk... -- Chad "^chewie, gunnarr" Walstrom <chewie@wookimus.net> http://wookimus.net/chewie
Attachment
> I'm using Linux and ext2fs has a 2GB limit on files, and it seems like > 6.5.3 tables are stored as single files, so better not go down that path :). I'm using 6.5.2 and it will split a table among several files, so that's not a problem. Though I'm not sure how BLOBs are stored since I never use them. I store large text in files mostly because I create full-text indexes using SWISH++. -- Adam Ruth InterCation, Inc. www.intercation.com "Lincoln Yeoh" <lylyeoh@mecomb.com> wrote in message news:3.0.5.32.20000417094047.008e4100@pop.mecomb.po.my... > Well I'm currently using the file system for large files. However because > of that I can see a few reasons why people might want to use Postgresql to > handle them. Others can probably mention more. > > Using Pg to handle large stuff makes more consistent overall and it's > easier for you to handle exceptions - e.g. if things fail the whole thing > is rolled back, and you theoretically don't get the large files dangling > around. Well ok you probably do until the next vacuum, but at least you > don't have to write your own vacuum to handle that ;). Basically you shift > the problem to Pg (and the very fine developers :) ). > > The reasons I decided to go file system were: > 1) I'm using Linux and ext2fs has a 2GB limit on files, and it seems like > 6.5.3 tables are stored as single files, so better not go down that path :). > 2) I'm using Perl, DBI etc and a brief look at BLOB handling put me off. > Maybe it was unwarranted, but given 1) I decided to call the whole thing off. > > Cheerio, > > Link. > >
On Mon, 17 Apr 2000, Adam Ruth wrote: > > I'm using Linux and ext2fs has a 2GB limit on files, and it seems like > > 6.5.3 tables are stored as single files, so better not go down that path > :). > > I'm using 6.5.2 and it will split a table among several files, so that's not > a problem. Though I'm not sure how BLOBs are stored since I never use them. BLOBS are currently implemented as a table/index pair, so they will split at the 1Gb mark as do normal tables. Peter -- Peter T Mount peter@retep.org.uk Main Homepage: http://www.retep.org.uk PostgreSQL JDBC Faq: http://www.retep.org.uk/postgres Java PDF Generator: http://www.retep.org.uk/pdf
At 08:29 PM 17-04-2000 +0100, Peter Mount wrote: >On Mon, 17 Apr 2000, Adam Ruth wrote: > >> > I'm using Linux and ext2fs has a 2GB limit on files, and it seems like >> > 6.5.3 tables are stored as single files, so better not go down that path >> :). >> >> I'm using 6.5.2 and it will split a table among several files, so that's not >> a problem. Though I'm not sure how BLOBs are stored since I never use them. > >BLOBS are currently implemented as a table/index pair, so they will split >at the 1Gb mark as do normal tables. Oh.. Ah well, maybe with my next version/app then. Where was it mentioned in the docs? Would like to look at the details. Thanks, Link.
On Tue, 18 Apr 2000, Lincoln Yeoh wrote: > At 08:29 PM 17-04-2000 +0100, Peter Mount wrote: > >On Mon, 17 Apr 2000, Adam Ruth wrote: > > > >> > I'm using Linux and ext2fs has a 2GB limit on files, and it seems like > >> > 6.5.3 tables are stored as single files, so better not go down that path > >> :). > >> > >> I'm using 6.5.2 and it will split a table among several files, so that's > not > >> a problem. Though I'm not sure how BLOBs are stored since I never use > them. > > > >BLOBS are currently implemented as a table/index pair, so they will split > >at the 1Gb mark as do normal tables. > > Oh.. Ah well, maybe with my next version/app then. Where was it mentioned > in the docs? Would like to look at the details. I'm not sure about the docs, but look at the storage manager portion of the source. It's pretty well documented in there. Peter -- Peter T Mount peter@retep.org.uk Main Homepage: http://www.retep.org.uk PostgreSQL JDBC Faq: http://www.retep.org.uk/postgres Java PDF Generator: http://www.retep.org.uk/pdf