Thread: To BLOB Or Not To BLOB

To BLOB Or Not To BLOB

From
Frank Joerdens
Date:
A while ago it was being held that the Postgres large object data type
was too new and not sufficiently tested and mature to be used in a
production environment. I am about to deploy a little database that
involves storing large-ish text files (20-500k) which could be either done
as large objects or as pointers to those files that would be then stored
as ordinary files in the OS's filesystem. I am undecided as to this
question. What are the pros and cons? What is the performance in either
case vis-a-vis the other? It is a web app; the interface is done in PHP.
Beginning from which version is the large object interface (if at all)
to be considered stable and ready for production?

cheers frank

--
frank joerdens

joerdens new media
heinrich-roller str. 16/17
10405 berlin
germany

e: frank@joerdens.de
t: +49 30 44055471
f: +49 30 44055475
h: http://www.joerdens.de

pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc

Re: To BLOB Or Not To BLOB

From
"Robert B. Easter"
Date:
On Sun, 16 Apr 2000, Frank Joerdens wrote:
> A while ago it was being held that the Postgres large object data type
> was too new and not sufficiently tested and mature to be used in a
> production environment. I am about to deploy a little database that
> involves storing large-ish text files (20-500k) which could be either done
> as large objects or as pointers to those files that would be then stored
> as ordinary files in the OS's filesystem. I am undecided as to this
> question. What are the pros and cons? What is the performance in either
> case vis-a-vis the other? It is a web app; the interface is done in PHP.
> Beginning from which version is the large object interface (if at all)
> to be considered stable and ready for production?
>

How are you going to dump and restore the large objects?  The pg_dump and
pg_dumpall programs will only handle the tables it seems.

There are some programs out there to dump large objects and I've been playing
with one.  It's worked well so far.  You can get it at
ftp://ftp2.zf.jcu.cz/zakkr/pg/

I'd like to know what other ways there are to handle dumping/restoring large
objects.  Isn't there any standard way that comes with PostgreSQL?

Nevertheless, I've been using php and large objects with no problems so far.
I've been able to fully dump the database with all large objects, upgrade
PostgreSQL to a new version, initdb, and restore everything.

Reading large objects seems fairly fast too.  On my website, all data is
stored in the database, including pictures.  The version that is up and running
reads the data out of the database on every request!  Since I use query
string urls like ?image=2345, web browsers don't cache it much.  Its not so slow
if you don't have a lot of traffic, but if you are planning for a lot of
traffic, try designing for on-the-fly content generation for a speed up.  That
is what I am doing on the devel version of my site.  This way the data is only
read out of the database on the first request then html and image files are
generated on the filesystem to serve the same request next time unless a query
string is passed which overrides and forces content regeneration.  Its a little
tricky but cool.

I'd say go for storing everything in the database so long as you master how to
dump and restore large objects as well as the databases/tables using
pg_dump/pg_dumpall.  This way, it is possible to really separate your site code
from its data.  All data backed up in the database and its dumps, all code (php
etc) backed up in a CVS repository.  Html pages and images generated on the
filesystem as an ordinary looking website can be seen as just temporary cached
data that can be sent out quickly when it matches a request.  Look into
mod_rewrite on apache.  If you want a simpler site that serves fast,
then just use the file pointer idea and keep the files outside the database.

Robert Easter
reaster@comptechnews.com

Re: To BLOB Or Not To BLOB

From
"John Henderson"
Date:
Hi,
It seems that the issue with large objects is "Why do you want the info in a
database?"

It seems to me that the point of a database is its ability to order and
relate data. If you want to retrieve the "large-ish text files" based on
their content then I think you need to have the files in the database so
they can be searched. However, if you are going to retrieve the files based
on something about them that is stored as a separate attribute, such as
their title, then I think that storing the files as large objects is
unnecessary.

If you have the option to "[do] as pointers t those files" then I think you
are not intending to use any of Postgres's features beyond its ability to
store and retrieve data. Surely the file system itself is 'better' at this
regardless of how stable Postgres LO's are.

I would sure like to hear someone else's opinion on this because it seems
that quite a few folks are using large objects and I am often curious about
why.

John Henderson

>A while ago it was being held that the Postgres large object data type
>was too new and not sufficiently tested and mature to be used in a
>production environment. I am about to deploy a little database that
>involves storing large-ish text files (20-500k) which could be either done
>as large objects or as pointers to those files that would be then stored
>as ordinary files in the OS's filesystem. I am undecided as to this
>question. What are the pros and cons? What is the performance in either
>case vis-a-vis the other? It is a web app; the interface is done in PHP.
>Beginning from which version is the large object interface (if at all)
>to be considered stable and ready for production?
>
>cheers frank
>
>--
>frank joerdens
>
>joerdens new media
>heinrich-roller str. 16/17
>10405 berlin
>germany
>
>e: frank@joerdens.de
>t: +49 30 44055471
>f: +49 30 44055475
>h: http://www.joerdens.de
>
>pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc
>


Re: To BLOB Or Not To BLOB

From
Lincoln Yeoh
Date:
Well I'm currently using the file system for large files. However because
of that I can see a few reasons why people might want to use Postgresql to
handle them. Others can probably mention more.

Using Pg to handle large stuff makes more consistent overall and it's
easier for you to handle exceptions - e.g. if things fail the whole thing
is rolled back, and you theoretically don't get the large files dangling
around. Well ok you probably do until the next vacuum, but at least you
don't have to write your own vacuum to handle that ;). Basically you shift
the problem to Pg (and the very fine developers :) ).

The reasons I decided to go file system were:
1) I'm using Linux and ext2fs has a 2GB limit on files, and it seems like
6.5.3 tables are stored as single files, so better not go down that path :).
2) I'm using Perl, DBI etc and a brief look at BLOB handling put me off.
Maybe it was unwarranted, but given 1) I decided to call the whole thing off.

Cheerio,

Link.



Re: To BLOB Or Not To BLOB

From
Marten Feldtmann
Date:
> A while ago it was being held that the Postgres large object data type
> was too new and not sufficiently tested and mature to be used in a
> production environment. I am about to deploy a little database that
> involves storing large-ish text files (20-500k) which could be either done
> as large objects or as pointers to those files that would be then stored
> as ordinary files in the OS's filesystem. I am undecided as to this
> question. What are the pros and cons? What is the performance in either
> case vis-a-vis the other? It is a web app; the interface is done in PHP.
> Beginning from which version is the large object interface (if at all)
> to be considered stable and ready for production?
>

 I would always use a reference solution. It's database independent
and if the contents you store is URL, then you may easily get the
referenced text file.

 Marten



Re: To BLOB Or Not To BLOB

From
Titus Brown
Date:
-> It seems that the issue with large objects is "Why do you want the info in a
-> database?"

To organize them, of course.

-> It seems to me that the point of a database is its ability to order and
-> relate data. If you want to retrieve the "large-ish text files" based on
-> their content then I think you need to have the files in the database so
-> they can be searched. However, if you are going to retrieve the files based
-> on something about them that is stored as a separate attribute, such as
-> their title, then I think that storing the files as large objects is
-> unnecessary.

Partly I would like to retrieve them based on their content, and it is
why I wrote (badly) a small extension that searches large objects for
text strings.

-> If you have the option to "[do] as pointers t those files" then I think you
-> are not intending to use any of Postgres's features beyond its ability to
-> store and retrieve data. Surely the file system itself is 'better' at this
-> regardless of how stable Postgres LO's are.

Yes, but the file system is external to the database.

A view that I am beginning to appreciate the more that I use databases
like PG and Oracle is that all of the relations within a database should
be internal.  Want to reference a title?  Reference to the table of titles.
Want to reference a bunch of data?  Point to the bunch of data.  If I
have to use the large object interface to do this, then I will.

This way, you can let the database worry about organizing and storing the
data, and all you need to do is worry about backing that database up --
not about making sure that the permissions on files are right, or that
the directory structure remains the same, or the DB remains synced to
the contents of the files...

This is also why the FOREIGN KEY functionality that e.g. MySQL completely
lacks and PostgreSQL is going to have as of 7.0 is so important -
maintaining referential integrity.

Here are two specific applications for which I am using large objects,
one in Oracle and one in PG, both involving online databases, both
using the ArsDigita Community System.

First of all, I managed the online paper submission and review process
for a conference using Oracle.  For this, it was very convenient to
be able to grab submissions and stuff them into the database, from which
they could be referenced by review etc.  Moreover, because of the
geographical distribution of the review committee, the ability to
grab the papers without giving access to the machine by shell or by
FTP (which is generally a security problem) was excellent.  Finally,
separate revisions of the papers could be kept separate using the
obvious internal mechanisms of the database.

The second application is one I'm working on right now, a genome database
project for the sea urchin.  For this, we need to be able to archive
arbitrarily large amounts of file-oriented data by (for example) sequence
name, and we want to give the option to search it as well.  The
only really convenient way to do this is to store it all in the
database.  (Note that it's not terribly *fast* to do it this way ;).

I would be happy to discuss this more if anyone has suggestions for
alternatives  -- I'm not wedded to the approach, but it seemed to be
the obvious one.

cheers,
--titus

Re: To BLOB Or Not To BLOB

From
chewie@wookimus.net
Date:
How do I start this?  There are so many different ways to look at this
question.  Do I save files inside the database, or do I simply use the database
as a tool to query information about a given file, it's properties, etc.

Instead of making general statements, let me pick specific examples.  Let's say
we want to create an MP3 repository.  Our goals are to be able to query the
information saved in ID3 tags, to locate, organize, manipulate, or play the
MP3.  To narrow the scope of discussion, let's say I will be accessing these
files locally, on my filesystem or a NFS mounted directory.

What would be the most efficient way for us to query the information found in
the ID3 tags?  We could parse each file in a number of directories and their
subdirectories for the information we require each time we need it, or we could
parse it once and save the information in a database structured around ID3
information.  I would say the later is definitely more desireable and
acceptable.  Inside the database, the ID3 tags can be indexed and our searches
optimized by the information we're requesting.

But should we store the mp3's inside the database?  I say no.  A few
observations:

    * Filesystems are generally better at allocating space for files than
        databases are.
    * Storing files inside databases renders shell access or direct access to
        files as impossible.  If the database goes South, so does access to the
        files.

Now, our mp3 decoders are shell programs that read mp3 input by opening a file,
a URL connection, or reading it from standard input (that is if we have a good
mp3 player).  Yet, in order for these mp3 decoders to be able to play an mp3
found inside the database, we are forced to use a specialized database
interface.  It may return the mp3 file to the decoder via standard input,
or it may export the file to a temp directory and feed the mp3 player the
location of that temp file.

Yet, why should we go through all of this trouble?  And why should we tie the
operation of two clearly separate applications into such inflexible
relationships?  We add overhead to the database to output this file, and we add
another level of logic to get the file to the decoder.  A level of logic that
could prove the downfall of a would-be MP3 mobile disc jockey, who's worked so
hard to please the Wicked Witch of the Wedding.  He'd be forced to pull out the
CD's and tapes he so dreads -- if he remembered to bring them along.

If we keep file access decoupled from gleaning catagorizable, quantifiable data
about/from the file, we allow two separate and efficient tools to do their jobs
independently.  If file access is your concern, then use a file system that
will give you the level of control you desire to develop your security
policies.

If grabbing information from a data file is important, such as parsing text
files for content, then build indexes to the data inside the file and store
those indexes in the database.  Internet directories are excellent examples of
this!  We don't neuter local file access by the applications that need the data
files, and we still provide access to searchable information.

I would argue that access to a data file and querying sets of data files for
information are two separate layers of logic.  Logical layers are best served
when they are given tools to interact, but are not crippled by tight
integration.  Saving data files inside a database, IMHO, is an example of tight
integration, and something to be avoided if possible -- at least in this case.

--
  Chad "^chewie, gunnarr" Walstrom <chewie@wookimus.net>
              http://wookimus.net/chewie

Re: To BLOB Or Not To BLOB

From
Stephen Davies
Date:
G'day.

I do a lot of work with the BASIS textual/multi-media RDBMS package and
run into this question all the time.

There is one pretty basic answer:

If you leave BLOBS lying around in the file system - particularly if
it is a Novell etc file system - people move them and the links get
broken.

Just today I had an example where several thousand links broke because
some dumb Novel administrator moved a directory holding files which
were linked to from a database rather than stored in it. (Not my design
decision)

A database that can hold the BLOBS internally avoids this sort of snafu.

(BASIS also has the advantage that it can automagically index almost
any form of BLOB to the word/phrase/sentence level if it is stored in
the database.)

Cheers,
Stephen Davies
"John Henderson" <jrh@is.com.fj>  wrote:
> Hi,
> It seems that the issue with large objects is "Why do you want the info in a
> database?"
>
> It seems to me that the point of a database is its ability to order and
> relate data. If you want to retrieve the "large-ish text files" based on
> their content then I think you need to have the files in the database so
> they can be searched. However, if you are going to retrieve the files based
> on something about them that is stored as a separate attribute, such as
> their title, then I think that storing the files as large objects is
> unnecessary.
>
> If you have the option to "[do] as pointers t those files" then I think you
> are not intending to use any of Postgres's features beyond its ability to
> store and retrieve data. Surely the file system itself is 'better' at this
> regardless of how stable Postgres LO's are.
>
> I would sure like to hear someone else's opinion on this because it seems
> that quite a few folks are using large objects and I am often curious about
> why.
>
> John Henderson
>
> >A while ago it was being held that the Postgres large object data type
> >was too new and not sufficiently tested and mature to be used in a
> >production environment. I am about to deploy a little database that
> >involves storing large-ish text files (20-500k) which could be either done
> >as large objects or as pointers to those files that would be then stored
> >as ordinary files in the OS's filesystem. I am undecided as to this
> >question. What are the pros and cons? What is the performance in either
> >case vis-a-vis the other? It is a web app; the interface is done in PHP.
> >Beginning from which version is the large object interface (if at all)
> >to be considered stable and ready for production?
> >
> >cheers frank
> >
> >--
> >frank joerdens
> >
> >joerdens new media
> >heinrich-roller str. 16/17
> >10405 berlin
> >germany
> >
> >e: frank@joerdens.de
> >t: +49 30 44055471
> >f: +49 30 44055475
> >h: http://www.joerdens.de
> >
> >pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc
> >




========================================================================
Stephen Davies Consulting                      scldad@sdc.com.au
Adelaide, South Australia.                       Voice: 08-8177 1595
Computing & Network solutions.               Fax: 08-8177 0133



Re: To BLOB Or Not To BLOB

From
"Thomas O'Dowd"
Date:
On Mon, Apr 17, 2000 at 06:12:37PM +0930, Stephen Davies wrote:
> G'day.
>
> I do a lot of work with the BASIS textual/multi-media RDBMS package and
> run into this question all the time.
>
> There is one pretty basic answer:
>
> If you leave BLOBS lying around in the file system - particularly if
> it is a Novell etc file system - people move them and the links get
> broken.

> Just today I had an example where several thousand links broke because
> some dumb Novel administrator moved a directory holding files which
> were linked to from a database rather than stored in it. (Not my design
> decision)

The problem is solving why stuff breaks if they are left on the
filesystem. Here are a couple of suggestions for what in my opinion
should be done when leaving data like this on the filesystem.

1. have a separate mirrored filesystem for your file data so that this
kind of data is not competing with other data/logs etc. It also makes
it more obvious that this data is important.
2. Have a cleanup policy if the filesystem grows alot, or introduc
monitor levels where you assign new disk.
3. scatter the file system with a README file saying why the files
are important and what is using them and what cleanup policy is in place.
Also tell them who to ask questions to, like an email list etc.
4. Use relative path names rather than full to reference the data
this way if stuff has to move it is easier to update.

The problem with program data, is that sometimes it is hard to tell
what is important and what isn't if you are not close to the application
or as the application gets older, so stuff gets moved and things break.

> A database that can hold the BLOBS internally avoids this sort of snafu.

I think that depending on what you need, it is very valid to implement
both ways.

Cheers,

Tom.

Broken Links...Re: To BLOB Or Not To BLOB

From
chewie@wookimus.net
Date:
On Mon, Apr 17, 2000 at 06:12:37PM +0930, Stephen Davies wrote:
> G'day.
>
> I do a lot of work with the BASIS textual/multi-media RDBMS package and run
> into this question all the time.
>
> There is one pretty basic answer:
>
> If you leave BLOBS lying around in the file system - particularly if it is a
> Novell etc file system - people move them and the links get broken.

I counter that people would not be able to move these files if you had used a
solid file permissions policy.  For this type of application, you must not give
write/change permissions to ANYONE other then the database operator itself, the
uid/gid reserved for the application.  Follow a strict policy and you won't
have problems.  Create enough "helper" programs to recover from broken links,
and you're fine.

    * CRC each file
    * Use a 'find' program to parse directories and match file names.
    * Check the CRC (md5sums would work nicely)
    * Have a DB routine run periodically to find and fix broken links.

This would also help you find multiple copies of a file on your system, perhaps
help you clean up extreaneous junk...

--
  Chad "^chewie, gunnarr" Walstrom <chewie@wookimus.net>
              http://wookimus.net/chewie

Attachment

Re: To BLOB Or Not To BLOB

From
"Adam Ruth"
Date:
> I'm using Linux and ext2fs has a 2GB limit on files, and it seems like
> 6.5.3 tables are stored as single files, so better not go down that path
:).

I'm using 6.5.2 and it will split a table among several files, so that's not
a problem.  Though I'm not sure how BLOBs are stored since I never use them.

I store large text in files mostly because I create full-text indexes using
SWISH++.

--
Adam Ruth
InterCation, Inc.
www.intercation.com
"Lincoln Yeoh" <lylyeoh@mecomb.com> wrote in message
news:3.0.5.32.20000417094047.008e4100@pop.mecomb.po.my...
> Well I'm currently using the file system for large files. However because
> of that I can see a few reasons why people might want to use Postgresql to
> handle them. Others can probably mention more.
>
> Using Pg to handle large stuff makes more consistent overall and it's
> easier for you to handle exceptions - e.g. if things fail the whole thing
> is rolled back, and you theoretically don't get the large files dangling
> around. Well ok you probably do until the next vacuum, but at least you
> don't have to write your own vacuum to handle that ;). Basically you shift
> the problem to Pg (and the very fine developers :) ).
>
> The reasons I decided to go file system were:
> 1) I'm using Linux and ext2fs has a 2GB limit on files, and it seems like
> 6.5.3 tables are stored as single files, so better not go down that path
:).
> 2) I'm using Perl, DBI etc and a brief look at BLOB handling put me off.
> Maybe it was unwarranted, but given 1) I decided to call the whole thing
off.
>
> Cheerio,
>
> Link.
>
>



Re: To BLOB Or Not To BLOB

From
Peter Mount
Date:
On Mon, 17 Apr 2000, Adam Ruth wrote:

> > I'm using Linux and ext2fs has a 2GB limit on files, and it seems like
> > 6.5.3 tables are stored as single files, so better not go down that path
> :).
>
> I'm using 6.5.2 and it will split a table among several files, so that's not
> a problem.  Though I'm not sure how BLOBs are stored since I never use them.

BLOBS are currently implemented as a table/index pair, so they will split
at the 1Gb mark as do normal tables.

Peter

--
       Peter T Mount peter@retep.org.uk
      Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgres
 Java PDF Generator: http://www.retep.org.uk/pdf


Re: To BLOB Or Not To BLOB

From
Lincoln Yeoh
Date:
At 08:29 PM 17-04-2000 +0100, Peter Mount wrote:
>On Mon, 17 Apr 2000, Adam Ruth wrote:
>
>> > I'm using Linux and ext2fs has a 2GB limit on files, and it seems like
>> > 6.5.3 tables are stored as single files, so better not go down that path
>> :).
>>
>> I'm using 6.5.2 and it will split a table among several files, so that's
not
>> a problem.  Though I'm not sure how BLOBs are stored since I never use
them.
>
>BLOBS are currently implemented as a table/index pair, so they will split
>at the 1Gb mark as do normal tables.

Oh.. Ah well, maybe with my next version/app then. Where was it mentioned
in the docs? Would like to look at the details.

Thanks,

Link.



Re: To BLOB Or Not To BLOB

From
Peter Mount
Date:
On Tue, 18 Apr 2000, Lincoln Yeoh wrote:

> At 08:29 PM 17-04-2000 +0100, Peter Mount wrote:
> >On Mon, 17 Apr 2000, Adam Ruth wrote:
> >
> >> > I'm using Linux and ext2fs has a 2GB limit on files, and it seems like
> >> > 6.5.3 tables are stored as single files, so better not go down that path
> >> :).
> >>
> >> I'm using 6.5.2 and it will split a table among several files, so that's
> not
> >> a problem.  Though I'm not sure how BLOBs are stored since I never use
> them.
> >
> >BLOBS are currently implemented as a table/index pair, so they will split
> >at the 1Gb mark as do normal tables.
>
> Oh.. Ah well, maybe with my next version/app then. Where was it mentioned
> in the docs? Would like to look at the details.

I'm not sure about the docs, but look at the storage manager portion of
the source. It's pretty well documented in there.

Peter

--
       Peter T Mount peter@retep.org.uk
      Main Homepage: http://www.retep.org.uk
PostgreSQL JDBC Faq: http://www.retep.org.uk/postgres
 Java PDF Generator: http://www.retep.org.uk/pdf