Thread: use BLOBS or use pointers to files on the filesystem?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi I'm kind of a database newbie in general, so be nice to me if this is a bonehead question. I am building a java servlet based web application to manage photos and images. My first thougth has been to store the images as BLOBs in the database, but I read one user say that you're better off not using BLOBs. Instead, he says, store your large data objects as files on the filesystem, and just use the database to store the path to the file and any file-related meta data. I wasn't sure what to think of this, so I was hoping to get some other ideas and opinions. I expect to accumulate 1-2 gigs of new data per year, max, with average image size being 700k. Also, if I've missed any obvious documentation on this issue please feel free to send it my way. Thanks! Travis -----BEGIN PGP SIGNATURE----- Version: PGPfreeware 6.5.8 for non-commercial use <http://www.pgp.com> iQA/AwUBPBP9vKBIzo/FCjIeEQLffgCePrTcI3ugWC9NhdRT12Pt1XFnW0gAoLl3 t8HQbiGAOJt6jDhZNZlk+2GL =N3bL -----END PGP SIGNATURE-----
On Sun, Dec 09, 2001 at 11:49:29PM +0000, Travis Bear wrote: > Hi > > I'm kind of a database newbie in general, so be nice to me if this > is a bonehead question. I am building a java servlet based web > application to manage photos and images. My first thougth has been > to store the images as BLOBs in the database, but I read one user say > that you're better off not using BLOBs. Instead, he says, store your > large data objects as files on the filesystem, and just use the > database to store the path to the file and any file-related meta > data. Compared to retrieving stuff directly from the database, this would be significantly faster. You could use a caching mechanism though to make that disadvantage go away (have a look at www.phpbuilder.com for examples on how to do that in PHP - which you might draw upon for your Java solution). The downside of having information in the filesystem (your images), and meta-information about it in the database (the path to the images etc.) is that you have to work much harder to make your application detect and fix concurrency issues: If any other application besides your Servlet can access the filesystem (which is pretty normal), those others might modify it without modifying the database. Then you have a filesystem out of sync with the database. Once you have that problem, it would be hard to recover from. (Could anyone explain that better? I can't seem to find the right words to make it really clear.) I prefer BLOBs + Caching. Regards, Frank
"Travis Bear" <ec9098@www2.us.postgresql.org> writes: > Hi > > I'm kind of a database newbie in general, so be nice to me if this > is a bonehead question. I am building a java servlet based web > application to manage photos and images. My first thougth has been > to store the images as BLOBs in the database, but I read one user > say that you're better off not using BLOBs. Instead, he says, store > your large data objects as files on the filesystem, and just use the > database to store the path to the file and any file-related meta > data. Like most programming questions this involves a series of tradeoffs. It used to be that PostgreSQL had an 8K limit per row (32K if you editted the source and compiled your own). PostgreSQL has a large object interface, but it is even clunkier to use than the filesystem and almost certainly slower. Thanks to TOAST the 8K limit is now essentially gone. You can store fairly large files in PostgreSQL quite easily (their still is a hard limit, but it is much much larger). However, the BYTEA type still needs some work (some characters must be escaped before they can be inserted), and there is a significant overhead involved (large fields are compressed and then split across several tuples). Filesystems are *really* good at sharing files. PostgreSQL, being filesystem based itself, isn't likely to ever really be competitive with a plain filesystem. > I wasn't sure what to think of this, so I was hoping to get some > other ideas and opinions. I expect to accumulate 1-2 gigs of new > data per year, max, with average image size being 700k. It depends on what you plan to do with these images. For example, I am actually storing images (about that size too) directly in PostgreSQL. In fact, I have even bypassed the BYTEA type and I am simply base64 encoding the images and inserting them into a big fat text field. However, I am doing this because these images should get fetched quite rarely, and when one of them does get fetched it is very likely that it will get fetched again in short order. So I can easily afford to base64 encode/decode. PostgreSQL's transactions are useful to me as well in that it is easier to insure that file changes go as planned. If an error occurs I simply roll back, and all is well. If something goes wrong while writing to a filesystem based image I am left with a corrupted image. Most importantly, however, I don't have to change my existing PostgreSQL backup scripts :). > Also, if I've missed any obvious documentation on this issue please > feel free to send it my way. Take a look at the large object material and the TOAST documentation as they represent the two alternatives that you have when saving binary (or merely large) fields in Postgresql. > Thanks! > > > Travis I hope this was helpful Jason
> > I'm kind of a database newbie in general, so be nice to me if this > > is a bonehead question. I am building a java servlet based web > > application to manage photos and images. My first thougth has been > > to store the images as BLOBs in the database, but I read > one user say > > that you're better off not using BLOBs. Instead, he says, > store your > > large data objects as files on the filesystem, and just use the > > database to store the path to the file and any file-related meta > > data. I remember that thread! That was me that basically said, "store your files on the filesystem, and your data in a database." In my world, we manage in the upwards of 18.5 million images. We have a small database table with about 5 fields (about 4.5gig of data) that tell us where the images are (over 2.3 terrabytes). So you see, we can keep our database server storage "small", and our image farm storage "big". We can upgrade our database server w/o bringing our images offline. We can upgrade our image farm w/o bringing the database offline. Another advantage is that storing pointers allows your images to live on a server that isn't yours! id | path ------------ 2312 | file:///c:/images/you.gif 2313 | http://www.somesite.com/images/me.gif 2314 | ftp://www.somesite.com/images/me2.gif As long as your db application understands the protocols you can point to images anywhere in the known universe. > Compared to retrieving stuff directly from the database, this would be absolutely it's going to be faster! Postgres, and most other db's for that matter, aren't designed to store files. They're not filesystems, they're databases! > significantly faster. You could use a caching mechanism though to make > that disadvantage go away (have a look at www.phpbuilder.com for > examples on how to do that in PHP - which you might draw upon for your > Java solution). caching is good. esp. if the originals are stored in a far away place. saves lots of round-trips. > > The downside of having information in the filesystem (your > images), and meta-information about it in the database (the path to the > images etc.) is that you have to work much harder to make your application > detect and fix concurrency issues: If any other application besides your Servlet The upside is that you have to work harder to make your application detect and fix concurrency issues. > can access the filesystem (which is pretty normal), those others might > modify it without modifying the database. This is a design decision to make up front: "Can (read: should) the filesystem be modified independently of the database?". I say YES: It's 4:45pm and Travis just received 10 CD's of images from a client that need to be loaded into his database by 6:00pm. Each CD contains 10,000 images. (Fourtunality, the CD's contain an index file that closely match his table schema). So while Travis is writing a little perl script to massage the index files into a bunch of INSERT statements, his assistant can begin copying the CD's over to the image server (or directory). And because Travis' storage and data systems are decoupled, he gets the job done on time and the client is happy. > Then you have a filesystem out of sync with the database. Once you have that problem, it > would be hard to recover from. (Could anyone explain that better? I can't seem to > find the right words to make it really clear.) Or, it could be part of the plan. Sure, storing the images as blobs is going to be easier from a development standpoint. But the easy solution isn't always the best solution. eric
On Monday 10 December 2001 12:40 pm, Jason Earl wrote: > "Travis Bear" <ec9098@www2.us.postgresql.org> writes: > > Hi > > > > I'm kind of a database newbie in general, so be nice to me if this > > is a bonehead question. I am building a java servlet based web > > application to manage photos and images. My first thougth has been > > to store the images as BLOBs in the database, but I read one user > > say that you're better off not using BLOBs. Instead, he says, store > > your large data objects as files on the filesystem, and just use the > > database to store the path to the file and any file-related meta > > data. > > simply base64 encoding the images and inserting them into a big fat > text field. However, I am doing this because these images should get I kinda like the Base64 idea. But if you are concerned about the performance of decoding base64 all the time and you are using Apache with mod_rewrite, you could devise a scheme where an access to a image file, if not present on the disk, would redirect to a servlet that would retrieve the file from the database and write it to disk and then do a redirect to the file on disk. Further accesses would get the file directly off disk if present instead of redirecting to the servlet and getting it from the database. This scheme, though a bit complicated to implement since mod_rewrite needs special regexp-like directives in .htaccess files or your httpd.conf file, can be nice since it can give the performance of having the files on disk and at the same time, the consistency of having all data in the database. Routines that update an image should delete the image file on disk so that on the next access to the image, mod_rewrite will redirect to the servlet to get the image fresh from the database and write it out again to disk. Bob
"Robert B. Easter" <reaster@comptechnews.com> writes: > On Monday 10 December 2001 12:40 pm, Jason Earl wrote: > > "Travis Bear" <ec9098@www2.us.postgresql.org> writes: > > > Hi > > > > > > I'm kind of a database newbie in general, so be nice to me if this > > > is a bonehead question. I am building a java servlet based web > > > application to manage photos and images. My first thougth has been > > > to store the images as BLOBs in the database, but I read one user > > > say that you're better off not using BLOBs. Instead, he says, store > > > your large data objects as files on the filesystem, and just use the > > > database to store the path to the file and any file-related meta > > > data. > > > > simply base64 encoding the images and inserting them into a big fat > > text field. However, I am doing this because these images should get > > I kinda like the Base64 idea. But if you are concerned about the > performance of decoding base64 all the time and you are using Apache > with mod_rewrite, you could devise a scheme where an access to a > image file, if not present on the disk, would redirect to a servlet > that would retrieve the file from the database and write it to disk > and then do a redirect to the file on disk. Further accesses would > get the file directly off disk if present instead of redirecting to > the servlet and getting it from the database. I actually started using BYTEA and manually escaping the two or three characters that must be escaped so that PostgreSQL will play (I can't remember right now how many characters need to be escaped), but I couldn't get this incantation to work properly with PyGreSQL, and the base64 conversions built in Python are written in C and were faster than my non-working pure-Python kludge. > This scheme, though a bit complicated to implement since mod_rewrite > needs special regexp-like directives in .htaccess files or your > httpd.conf file, can be nice since it can give the performance of > having the files on disk and at the same time, the consistency of > having all data in the database. I am already using Apache with mod_rewrite to front-end Zope, what's a few more rewrite rules amongst friends? > Routines that update an image should delete the image file on disk > so that on the next access to the image, mod_rewrite will redirect > to the servlet to get the image fresh from the database and write it > out again to disk. That is a lot more clever than the caching scheme I was planning on implementing (should the need arise). Right now I am content to let Zope cache the images in memory. I was considering timestamping my tuples (including triggers to update the timestamps on update) and then comparing these timestamps with the file creation dates. If the timestamp is newer than the file date then the file gets updated (and served up). Of course, that would be harder on the database (A select would always be required, and another select would be required if an updated image were necessary), and the logic is a considerably trickier than "if file doesn't exist fetch it." Especially since updates are going to be very rare. Thanks for the ideas, Jason
I guess you need to go through the benefits you gets from a DBMS and decide how they would work with files. 1. If the DBMS machine crashes, how do you "recover"? If you overwrite BLOB files when you do DBMS updates you are in trouble. If you always add new files, then you need to figure out when it is safe to cleanup (VACUUM) the old (unused) files. 2. How do you make consistent backups of your data? Do you backup the DBMS or the files first? 3. If you lose some data (either in a table or in a BLOB file), how do you get the data back (restore from backup) to a consistent state? 4. Is it OK for your application to use two different API's (SQL and file system) to access data? Don't get me wrong, I have been wrestling with the same decision. Our app needs to read and update a 10 KB blob in under a second. It would probably be much faster using files. But is the speed worth the reliability issues mentioned above? Brian Beuning Travis Bear wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi > > I'm kind of a database newbie in general, so be nice to me if this > is a bonehead question. I am building a java servlet based web > application to manage photos and images. My first thougth has been > to store the images as BLOBs in the database, but I read one user say > that you're better off not using BLOBs. Instead, he says, store your > large data objects as files on the filesystem, and just use the > database to store the path to the file and any file-related meta > data. > > I wasn't sure what to think of this, so I was hoping to get some > other ideas and opinions. I expect to accumulate 1-2 gigs of new > data per year, max, with average image size being 700k. > > Also, if I've missed any obvious documentation on this issue please > feel free to send it my way. > > Thanks! > > Travis > > -----BEGIN PGP SIGNATURE----- > Version: PGPfreeware 6.5.8 for non-commercial use <http://www.pgp.com> > > iQA/AwUBPBP9vKBIzo/FCjIeEQLffgCePrTcI3ugWC9NhdRT12Pt1XFnW0gAoLl3 > t8HQbiGAOJt6jDhZNZlk+2GL > =N3bL > -----END PGP SIGNATURE-----