Thread: use BLOBS or use pointers to files on the filesystem?

use BLOBS or use pointers to files on the filesystem?

From

"Travis Bear"

Date:

10 December 2001, 10:46:06

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi

I'm kind of a database newbie in general,  so be nice to me if this
is a bonehead question.  I am building a java servlet based web
application to manage photos and images.  My first thougth has been
to store the images as BLOBs in the database, but I read one user say
that you're better off not using BLOBs.  Instead, he says, store your
large data objects as files on the filesystem, and just use the
database to store the path to the file and any file-related meta
data.

I wasn't sure what to think of this, so I was hoping to get some
other ideas and opinions.  I expect to accumulate 1-2 gigs of new
data per year, max, with average image size being 700k.

Also, if I've missed any obvious documentation on this issue please
feel free to send it my way.

Thanks!


Travis

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 6.5.8 for non-commercial use <http://www.pgp.com>

iQA/AwUBPBP9vKBIzo/FCjIeEQLffgCePrTcI3ugWC9NhdRT12Pt1XFnW0gAoLl3
t8HQbiGAOJt6jDhZNZlk+2GL
=N3bL
-----END PGP SIGNATURE-----

Re: use BLOBS or use pointers to files on the filesystem?

From

Frank Joerdens

Date:

10 December 2001, 13:09:30

On Sun, Dec 09, 2001 at 11:49:29PM +0000, Travis Bear wrote:
> Hi
>
> I'm kind of a database newbie in general,  so be nice to me if this
> is a bonehead question.  I am building a java servlet based web
> application to manage photos and images.  My first thougth has been
> to store the images as BLOBs in the database, but I read one user say
> that you're better off not using BLOBs.  Instead, he says, store your
> large data objects as files on the filesystem, and just use the
> database to store the path to the file and any file-related meta
> data.

Compared to retrieving stuff directly from the database, this would be
significantly faster. You could use a caching mechanism though to make
that disadvantage go away (have a look at www.phpbuilder.com for
examples on how to do that in PHP - which you might draw upon for your
Java solution).

The downside of having information in the filesystem (your images), and
meta-information about it in the database (the path to the images etc.)
is that you have to work much harder to make your application detect and
fix concurrency issues: If any other application besides your Servlet
can access the filesystem (which is pretty normal), those others might
modify it without modifying the database. Then you have a filesystem out
of sync with the database. Once you have that problem, it would be hard
to recover from. (Could anyone explain that better? I can't seem to
find the right words to make it really clear.)

I prefer BLOBs + Caching.

Regards, Frank

Re: use BLOBS or use pointers to files on the filesystem?

From

Jason Earl

Date:

10 December 2001, 13:57:02

"Travis Bear" <ec9098@www2.us.postgresql.org> writes:

> Hi
>
> I'm kind of a database newbie in general, so be nice to me if this
> is a bonehead question.  I am building a java servlet based web
> application to manage photos and images.  My first thougth has been
> to store the images as BLOBs in the database, but I read one user
> say that you're better off not using BLOBs.  Instead, he says, store
> your large data objects as files on the filesystem, and just use the
> database to store the path to the file and any file-related meta
> data.

Like most programming questions this involves a series of tradeoffs.
It used to be that PostgreSQL had an 8K limit per row (32K if you
editted the source and compiled your own).  PostgreSQL has a large
object interface, but it is even clunkier to use than the filesystem
and almost certainly slower.

Thanks to TOAST the 8K limit is now essentially gone.  You can store
fairly large files in PostgreSQL quite easily (their still is a hard
limit, but it is much much larger).  However, the BYTEA type still
needs some work (some characters must be escaped before they can be
inserted), and there is a significant overhead involved (large fields
are compressed and then split across several tuples).

Filesystems are *really* good at sharing files.  PostgreSQL, being
filesystem based itself, isn't likely to ever really be competitive
with a plain filesystem.

> I wasn't sure what to think of this, so I was hoping to get some
> other ideas and opinions.  I expect to accumulate 1-2 gigs of new
> data per year, max, with average image size being 700k.

It depends on what you plan to do with these images.  For example, I
am actually storing images (about that size too) directly in
PostgreSQL.  In fact, I have even bypassed the BYTEA type and I am
simply base64 encoding the images and inserting them into a big fat
text field.  However, I am doing this because these images should get
fetched quite rarely, and when one of them does get fetched it is very
likely that it will get fetched again in short order.  So I can easily
afford to base64 encode/decode.  PostgreSQL's transactions are useful
to me as well in that it is easier to insure that file changes go as
planned.  If an error occurs I simply roll back, and all is well.  If
something goes wrong while writing to a filesystem based image I am
left with a corrupted image.

Most importantly, however, I don't have to change my existing
PostgreSQL backup scripts :).

> Also, if I've missed any obvious documentation on this issue please
> feel free to send it my way.

Take a look at the large object material and the TOAST documentation
as they represent the two alternatives that you have when saving
binary (or merely large) fields in Postgresql.

> Thanks!
>
>
> Travis

I hope this was helpful

Jason

Re: use BLOBS or use pointers to files on the filesystem?

From

"Eric Ridge"

Date:

10 December 2001, 14:29:19

> > I'm kind of a database newbie in general,  so be nice to me if this
> > is a bonehead question.  I am building a java servlet based web
> > application to manage photos and images.  My first thougth has been
> > to store the images as BLOBs in the database, but I read
> one user say
> > that you're better off not using BLOBs.  Instead, he says,
> store your
> > large data objects as files on the filesystem, and just use the
> > database to store the path to the file and any file-related meta
> > data.

I remember that thread!  That was me that basically said, "store your
files on the filesystem, and your data in a database."

In my world, we manage in the upwards of 18.5 million images.  We have a
small database table with about 5 fields (about 4.5gig of data) that
tell us where the images are (over 2.3 terrabytes).

So you see, we can keep our database server storage "small", and our
image farm storage "big".  We can upgrade our database server w/o
bringing our images offline.  We can upgrade our image farm w/o bringing
the database offline.

Another advantage is that storing pointers allows your images to live on
a server that isn't yours!

id   | path
------------
2312 | file:///c:/images/you.gif
2313 | http://www.somesite.com/images/me.gif
2314 | ftp://www.somesite.com/images/me2.gif

As long as your db application understands the protocols you can point
to images anywhere in the known universe.

> Compared to retrieving stuff directly from the database, this would be

absolutely it's going to be faster!  Postgres, and most other db's for
that matter, aren't designed to store files.  They're not filesystems,
they're databases!

> significantly faster. You could use a caching mechanism though to make
> that disadvantage go away (have a look at www.phpbuilder.com for
> examples on how to do that in PHP - which you might draw upon for your
> Java solution).

caching is good.  esp. if the originals are stored in a far away place.
saves lots of round-trips.

>
> The downside of having information in the filesystem (your
> images), and meta-information about it in the database (the path to
the
> images etc.) is that you have to work much harder to make your
application
> detect and fix concurrency issues: If any other application besides
your Servlet

The upside is that you have to work harder to make your application
detect and fix concurrency issues.

> can access the filesystem (which is pretty normal), those others might
> modify it without modifying the database.

This is a design decision to make up front:  "Can (read: should) the
filesystem be modified independently of the database?".  I say YES:

It's 4:45pm and Travis just received 10 CD's of images from a client
that need to be loaded into his database by 6:00pm.  Each CD contains
10,000 images.  (Fourtunality, the CD's contain an index file that
closely match his table schema).  So while Travis is writing a little
perl script to massage the index files into a bunch of INSERT
statements, his assistant can begin copying the CD's over to the image
server (or directory).

And because Travis' storage and data systems are decoupled, he gets the
job done on time and the client is happy.

> Then you have a filesystem out of sync with the database. Once you
have that problem, it
> would be hard to recover from. (Could anyone explain that better? I
can't seem to
> find the right words to make it really clear.)

Or, it could be part of the plan.

Sure, storing the images as blobs is going to be easier from a
development standpoint.  But the easy solution isn't always the best
solution.

eric

Re: use BLOBS or use pointers to files on the filesystem?

From

"Robert B. Easter"

Date:

10 December 2001, 14:50:06

On Monday 10 December 2001 12:40 pm, Jason Earl wrote:
> "Travis Bear" <ec9098@www2.us.postgresql.org> writes:
> > Hi
> >
> > I'm kind of a database newbie in general, so be nice to me if this
> > is a bonehead question.  I am building a java servlet based web
> > application to manage photos and images.  My first thougth has been
> > to store the images as BLOBs in the database, but I read one user
> > say that you're better off not using BLOBs.  Instead, he says, store
> > your large data objects as files on the filesystem, and just use the
> > database to store the path to the file and any file-related meta
> > data.
>
> simply base64 encoding the images and inserting them into a big fat
> text field.  However, I am doing this because these images should get

I kinda like the Base64 idea. But if you are concerned about the performance
of decoding base64 all the time and you are using Apache with mod_rewrite,
you could devise a scheme where an access to a image file, if not present on
the disk, would redirect to a servlet that would retrieve the file from the
database and write it to disk and then do a redirect to the file on disk.
Further accesses would get the file directly off disk if present instead of
redirecting to the servlet and getting it from the database.

This scheme, though a bit complicated to implement since mod_rewrite needs
special regexp-like directives in .htaccess files or your httpd.conf file,
can be nice since it can give the performance of having the files on disk and
at the same time, the consistency of having all data in the database.
Routines that update an image should delete the image file on disk so that on
the next access to the image, mod_rewrite will redirect to the servlet to get
the image fresh from the database and write it out again to disk.

Bob

Re: use BLOBS or use pointers to files on the filesystem?

From

Jason Earl

Date:

10 December 2001, 17:37:47

"Robert B. Easter" <reaster@comptechnews.com> writes:

> On Monday 10 December 2001 12:40 pm, Jason Earl wrote:
> > "Travis Bear" <ec9098@www2.us.postgresql.org> writes:
> > > Hi
> > >
> > > I'm kind of a database newbie in general, so be nice to me if this
> > > is a bonehead question.  I am building a java servlet based web
> > > application to manage photos and images.  My first thougth has been
> > > to store the images as BLOBs in the database, but I read one user
> > > say that you're better off not using BLOBs.  Instead, he says, store
> > > your large data objects as files on the filesystem, and just use the
> > > database to store the path to the file and any file-related meta
> > > data.
> >
> > simply base64 encoding the images and inserting them into a big fat
> > text field.  However, I am doing this because these images should get
>
> I kinda like the Base64 idea. But if you are concerned about the
> performance of decoding base64 all the time and you are using Apache
> with mod_rewrite, you could devise a scheme where an access to a
> image file, if not present on the disk, would redirect to a servlet
> that would retrieve the file from the database and write it to disk
> and then do a redirect to the file on disk.  Further accesses would
> get the file directly off disk if present instead of redirecting to
> the servlet and getting it from the database.

I actually started using BYTEA and manually escaping the two or three
characters that must be escaped so that PostgreSQL will play (I can't
remember right now how many characters need to be escaped), but I
couldn't get this incantation to work properly with PyGreSQL, and the
base64 conversions built in Python are written in C and were faster
than my non-working pure-Python kludge.

> This scheme, though a bit complicated to implement since mod_rewrite
> needs special regexp-like directives in .htaccess files or your
> httpd.conf file, can be nice since it can give the performance of
> having the files on disk and at the same time, the consistency of
> having all data in the database.

I am already using Apache with mod_rewrite to front-end Zope, what's a
few more rewrite rules amongst friends?

> Routines that update an image should delete the image file on disk
> so that on the next access to the image, mod_rewrite will redirect
> to the servlet to get the image fresh from the database and write it
> out again to disk.

That is a lot more clever than the caching scheme I was planning on
implementing (should the need arise).  Right now I am content to let
Zope cache the images in memory.  I was considering timestamping my
tuples (including triggers to update the timestamps on update) and
then comparing these timestamps with the file creation dates.  If the
timestamp is newer than the file date then the file gets updated (and
served up).  Of course, that would be harder on the database (A select
would always be required, and another select would be required if an
updated image were necessary), and the logic is a considerably
trickier than "if file doesn't exist fetch it."  Especially since
updates are going to be very rare.

Thanks for the ideas,
Jason

Re: use BLOBS or use pointers to files on the filesystem?

From

Brian Beuning

Date:

11 December 2001, 12:59:16

I guess you need to go through the benefits you gets from a DBMS
and decide how they would work with files.

1. If the DBMS machine crashes, how do you "recover"?  If you overwrite
    BLOB files when you do DBMS updates you are in trouble.  If you always
add
    new files, then you need to figure out when it is safe to cleanup
(VACUUM) the old
    (unused) files.
2. How do you make consistent backups of your data?  Do you backup the
    DBMS or the files first?
3. If you lose some data (either in a table or in a BLOB file), how do you
get
    the data back (restore from backup) to a consistent state?
4. Is it OK for your application to use two different API's (SQL and file
system)
    to access data?

Don't get me wrong, I have been wrestling with the same decision.  Our app

needs to read and update a 10 KB blob in under a second.  It would
probably
be much faster using files.  But is the speed worth the reliability issues
mentioned
above?

Brian Beuning


Travis Bear wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi
>
> I'm kind of a database newbie in general,  so be nice to me if this
> is a bonehead question.  I am building a java servlet based web
> application to manage photos and images.  My first thougth has been
> to store the images as BLOBs in the database, but I read one user say
> that you're better off not using BLOBs.  Instead, he says, store your
> large data objects as files on the filesystem, and just use the
> database to store the path to the file and any file-related meta
> data.
>
> I wasn't sure what to think of this, so I was hoping to get some
> other ideas and opinions.  I expect to accumulate 1-2 gigs of new
> data per year, max, with average image size being 700k.
>
> Also, if I've missed any obvious documentation on this issue please
> feel free to send it my way.
>
> Thanks!
>
> Travis
>
> -----BEGIN PGP SIGNATURE-----
> Version: PGPfreeware 6.5.8 for non-commercial use <http://www.pgp.com>
>
> iQA/AwUBPBP9vKBIzo/FCjIeEQLffgCePrTcI3ugWC9NhdRT12Pt1XFnW0gAoLl3
> t8HQbiGAOJt6jDhZNZlk+2GL
> =N3bL
> -----END PGP SIGNATURE-----