Thread: Database versus filesystem for storing images

Database versus filesystem for storing images

From
Clodoaldo
Date:
A web application will store hundreds of thousands images.

The images meta data will be stored in postgresql but I'm in doubt
about where to store the images themselves, if in the database as byte
strings or in the file system.

Would one of them have a faster image search than the other?

In the database the images would be indexed by a serial id, that of
the file submitter, and an image number. They would always be searched
by the same key.

In the file system there would be a two or three level directory
structure and the directory names would be hexadecimal numbers like
'/images_directory/f/0/d' taken from the 2/3 first letters of the file
name, which would be a hash.

Whenever I want to know the hash name of an image so I can find it in
the file system I would have to search it first in the database by the
index. So I suppose the database will always be faster.

But there is an advantage in using the file system. It is that the
Apache's HTTP file caching management is there for free. If I store
the images in the database I will have to handle the request and
response HTTP headers myself if I want the images to be cached by the
clients.

What is the best practice in this situation? I mean not only the two
options as above but any.

Regards,
--
Clodoaldo Pinto Neto

Re: Database versus filesystem for storing images

From
Guy Rouillier
Date:
Clodoaldo wrote:
> A web application will store hundreds of thousands images.

<snip>

> What is the best practice in this situation? I mean not only the two
> options as above but any.

This discussion comes up regularly.  See the archives for a thread
titled "Storing images in PostgreSQL databases (again)" for the latest.

--
Guy Rouillier

Re: Database versus filesystem for storing images

From
Scott Ribe
Date:
Personally, I'd put them on the file system, because then backup software
can perform incremental backups. In the database, that becomes more of a
difficulty. One suggestion, don't use a file name from a hash to store the
image, just use the serial id, and break them up by hundreds or thousands,
iow image 1123 might be in images/000/000001/000001123.

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



Re: Database versus filesystem for storing images

From
"imageguy"
Date:
Scott Ribe wrote:
> Personally, I'd put them on the file system, because then backup software
> can perform incremental backups. In the database, that becomes more of a
> difficulty. One suggestion, don't use a file name from a hash to store the
> image, just use the serial id, and break them up by hundreds or thousands,
> iow image 1123 might be in images/000/000001/000001123.
>
> --
> Scott Ribe
> scott_ribe@killerbytes.com
> http://www.killerbytes.com/
> (303) 722-0567 voice

I think I know the answer, but if you don't have an "application
server" - ie a webserver, etc, and many of the workstations/clients
that need access to the images but may not have access to a network
share, isn't the database the only choice ?

 - or is there a postgresql function/utility that will "server" the
file from the file system based on the reference/link embeded in the
database ??

Geoff.


Re: Database versus filesystem for storing images

From
Clodoaldo
Date:
5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>:
>
> I think I know the answer,

If you know the answer please tell it as I have read some discussions
on the web and although I have decided on a solution I'm still not
sure about the best answer, if there is a best answer after all.

> but if you don't have an "application
> server" - ie a webserver, etc,

Yes I have an application server, the Apache server.

> and many of the workstations/clients
> that need access to the images but may not have access to a network
> share,

network share? I don't understand. The images will be loaded by html
pages with the img tag like in <img
src="http://domain.com/images/xxx.jpg">

> isn't the database the only choice ?

No. It is one of the choices. The other is to store the images in the
file system, in a directory readable by Apache.

>  - or is there a postgresql function/utility that will "server" the
> file from the file system based on the reference/link embeded in the
> database ??

I think some procedure languages can read files. In this case what
would be the gain in introducing a middle man, the db server?

Regards,
--
Clodoaldo Pinto Neto

Re: Database versus filesystem for storing images

From
"Jeremy Haile"
Date:
It's almost always better to store the images on the file system and
just store the filename or relative path in the database.

This is more efficient, doesn't bloat the database by storing files in
it, and is easier to get proper browser caching behavior (depending on
how your app is setup).   I try to avoid BLOBs whenever possible.

Cheers,
Jeremy Haile


On Fri, 5 Jan 2007 17:18:10 -0200, "Clodoaldo"
<clodoaldo.pinto.neto@gmail.com> said:
> 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>:
> >
> > I think I know the answer,
>
> If you know the answer please tell it as I have read some discussions
> on the web and although I have decided on a solution I'm still not
> sure about the best answer, if there is a best answer after all.
>
> > but if you don't have an "application
> > server" - ie a webserver, etc,
>
> Yes I have an application server, the Apache server.
>
> > and many of the workstations/clients
> > that need access to the images but may not have access to a network
> > share,
>
> network share? I don't understand. The images will be loaded by html
> pages with the img tag like in <img
> src="http://domain.com/images/xxx.jpg">
>
> > isn't the database the only choice ?
>
> No. It is one of the choices. The other is to store the images in the
> file system, in a directory readable by Apache.
>
> >  - or is there a postgresql function/utility that will "server" the
> > file from the file system based on the reference/link embeded in the
> > database ??
>
> I think some procedure languages can read files. In this case what
> would be the gain in introducing a middle man, the db server?
>
> Regards,
> --
> Clodoaldo Pinto Neto
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

Re: Database versus filesystem for storing images

From
John McCawley
Date:
Don't store your images in the database.  Store them on the filesystem
and store their path in the database.  Anyone that tells you otherwise
is a stark raving madman :)

My system is very heavily used, and our pg_dump is only a few gigs.
Meanwhile our images/documents storage is well over a hundred gigs.  I'd
hate to think that I'd have to dump and restore 100 gigs every time I
wanted to dump the newest data to the development database.


As far as how they actually get to the client machine, typically these
days people use web servers for this sort of thing.

Clodoaldo wrote:

> 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>:
>
>>
>> I think I know the answer,
>
>
> If you know the answer please tell it as I have read some discussions
> on the web and although I have decided on a solution I'm still not
> sure about the best answer, if there is a best answer after all.
>
>> but if you don't have an "application
>> server" - ie a webserver, etc,
>
>
> Yes I have an application server, the Apache server.
>
>> and many of the workstations/clients
>> that need access to the images but may not have access to a network
>> share,
>
>
> network share? I don't understand. The images will be loaded by html
> pages with the img tag like in <img
> src="http://domain.com/images/xxx.jpg">
>
>> isn't the database the only choice ?
>
>
> No. It is one of the choices. The other is to store the images in the
> file system, in a directory readable by Apache.
>
>>  - or is there a postgresql function/utility that will "server" the
>> file from the file system based on the reference/link embeded in the
>> database ??
>
>
> I think some procedure languages can read files. In this case what
> would be the gain in introducing a middle man, the db server?
>
> Regards,


Re: Database versus filesystem for storing images

From
Jorge Godoy
Date:
John McCawley <nospam@hardgeus.com> writes:

> Don't store your images in the database.  Store them on the filesystem and
> store their path in the database.  Anyone that tells you otherwise is a stark
> raving madman :)
>
> My system is very heavily used, and our pg_dump is only a few gigs.  Meanwhile
> our images/documents storage is well over a hundred gigs.  I'd hate to think
> that I'd have to dump and restore 100 gigs every time I wanted to dump the
> newest data to the development database.

How do you plan your backup routine and how do you guarantee that on a failure
all needed data is restored?  I mean, how do you handle integrity with data
outside the database?

--
Jorge Godoy      <jgodoy@gmail.com>

Re: Database versus filesystem for storing images

From
Andrew Chernow
Date:
 >> Don't store your images in the database.  Store them on the filesystem and
 >> store their path in the database
I 100% agree.  Use the database as a lookup into the filesystem.  Don't load the
database up with terabytes of non-searchable binary data?  not sure how that
would help you?

Here is one idea:

Have a 64-bit sequence that you use to generate an image_id (becomes file name).
  Hash that id out over a 3-level deep directory structure that allows 4096
entries per directory.  Should give you 64-bit worth of files.

CREATE OR REPLACE FUNCTION get_image_path(image_id BIGINT)
RETURNS TEXT AS $$
   DECLARE
     -- 40963, avoids "integer out of range"
     dir1_val BIGINT := 68719476736;
     dir1 BIGINT;
     dir2 BIGINT;
     dir3 BIGINT;
     path TEXT;

   BEGIN
     dir1 := (image_id / dir1_val) % 4096;
     dir2 := (image_id / (4096 * 4096)) % 4096;
     dir3 := (image_id / 4096) % 4096;
     RETURN '/BASE_PATH/' || dir1 || '/' || dir2 ||
            '/' || dir3 || '/' || image_id;
   END;
$$ LANGUAGE PLPGSQL;

test=# select get_image_path(200399322222);
            get_image_path
-------------------------------------
  /BASE_PATH/2/3752/2991/200399322222
(1 row)

 >I mean, how do you handle integrity with data
 > outside the database?
You don't, the file system handles integrity of the stored data.  Although, one
must careful to avoid db and fs orphans.  Meaning, a record with no
corresponding file or a file with no corresponging record.  Always
write()/insert an image file to the system within a transaction, including
writing the image out to the fs.  Make sure to unlink any paritally written
image files.

 >>How do you plan your backup routine
In regards to backup, backup the files one-by-one.  Grab the lastest image file
refs from the database and start backing up those images.  Each successfully
backed up image should be followed by inserting that file's database record into
a remote db server.  If anything fails, cleanup the partial image file (to avoid
orphaned data) and rollout the transaction.

just one idea.  i'm sure there are other ways of doing it.  point is, this is
completely possible to do reliably.

andrew



Jorge Godoy wrote:
> John McCawley <nospam@hardgeus.com> writes:
>
>> Don't store your images in the database.  Store them on the filesystem and
>> store their path in the database.  Anyone that tells you otherwise is a stark
>> raving madman :)
>>
>> My system is very heavily used, and our pg_dump is only a few gigs.  Meanwhile
>> our images/documents storage is well over a hundred gigs.  I'd hate to think
>> that I'd have to dump and restore 100 gigs every time I wanted to dump the
>> newest data to the development database.
>
> How do you plan your backup routine and how do you guarantee that on a failure
> all needed data is restored?  I mean, how do you handle integrity with data
> outside the database?
>

Re: Database versus filesystem for storing images

From
James Neff
Date:
"... and Moses said unto them, 'The eleventh commandment :  thou shalt
store images in a database!'..."

What if you had another database where you stored just the images and
not back it up if you don't want to?

As an application developer, I like the idea of storing files and images
in the database because it makes it much easier to control access and
security from an application standpoint.

I think Microsoft SQL Server stores blobs in a separate file, and only
retains pointers in the actually database field for that blob.  So when
you SELECT on that blob MS SQL reads the external file for you as if it
lived in the database.  I don't know if Postgres does the same thing,
but if it did, you wouldn't have to worry about "bloating" database files.

Sounds like this is for an Apache web application.  Think about how web
sites like Flickr and Webshots store their images in a database.  You
could write a cool Apache mod so that the url:
"http://mycompany.com/images/01234.jpg"  would go through this module,
pull the appropriate image from the database and send it back; all the
while the client is none-the-wiser.  Just a thought.

I think its one of those things where there's not right or wrong
answer.  Instead you just have to do the minimum of what your
application requires.  If you don't need application-level control over
the files, then by all means store them on the file system.  But if you
need to control security than you have to prevent physical access to the
file (which means no file system storage) and pull the image from the
database through the application.

My two cents,
James



John McCawley wrote:
> Don't store your images in the database.  Store them on the filesystem
> and store their path in the database.  Anyone that tells you otherwise
> is a stark raving madman :)
>
> My system is very heavily used, and our pg_dump is only a few gigs.
> Meanwhile our images/documents storage is well over a hundred gigs.
> I'd hate to think that I'd have to dump and restore 100 gigs every
> time I wanted to dump the newest data to the development database.
>
>
> As far as how they actually get to the client machine, typically these
> days people use web servers for this sort of thing.
> Clodoaldo wrote:
>
>> 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>:
>>
>>>
>>> I think I know the answer,
>>
>>
>> If you know the answer please tell it as I have read some discussions
>> on the web and although I have decided on a solution I'm still not
>> sure about the best answer, if there is a best answer after all.
>>
>>> but if you don't have an "application
>>> server" - ie a webserver, etc,
>>
>>
>> Yes I have an application server, the Apache server.
>>
>>> and many of the workstations/clients
>>> that need access to the images but may not have access to a network
>>> share,
>>
>>
>> network share? I don't understand. The images will be loaded by html
>> pages with the img tag like in <img
>> src="http://domain.com/images/xxx.jpg">
>>
>>> isn't the database the only choice ?
>>
>>
>> No. It is one of the choices. The other is to store the images in the
>> file system, in a directory readable by Apache.
>>
>>>  - or is there a postgresql function/utility that will "server" the
>>> file from the file system based on the reference/link embeded in the
>>> database ??
>>
>>
>> I think some procedure languages can read files. In this case what
>> would be the gain in introducing a middle man, the db server?
>>
>> Regards,
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org/
>


Re: Database versus filesystem for storing images

From
John McCawley
Date:
This is a web app, so in my example all of the images live on a web
server, and our data lives on a separate database server.  We have a
completely duplicated setup offsite, and mirror images of every server
at the backup site.  Every night we use rsync to duplicate everything
offsite.  Also, a cron job pg_dumps every night and copies the dump over
to the backup DB server.

And before anybody gives me any guff, our office is in New Orleans, and
we went through Katrina with less than an hour of downtime, and without
losing anything.  So there ;)



Jorge Godoy wrote:

>John McCawley <nospam@hardgeus.com> writes:
>
>
>
>>Don't store your images in the database.  Store them on the filesystem and
>>store their path in the database.  Anyone that tells you otherwise is a stark
>>raving madman :)
>>
>>My system is very heavily used, and our pg_dump is only a few gigs.  Meanwhile
>>our images/documents storage is well over a hundred gigs.  I'd hate to think
>>that I'd have to dump and restore 100 gigs every time I wanted to dump the
>>newest data to the development database.
>>
>>
>
>How do you plan your backup routine and how do you guarantee that on a failure
>all needed data is restored?  I mean, how do you handle integrity with data
>outside the database?
>
>
>

Re: Database versus filesystem for storing images

From
"Jeanna Geier"
Date:
We use WebDAV and Apache's Slide to store our images and, as someone pointed out earlier, store the links to the images
inour database. 

WebDAV has provided us with excellent access control and security...
http://www.webdav.org/
http://jakarta.apache.org/slide/index.html

Just my 1/2 cents,
-Jeanna

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org]On Behalf Of James Neff
Sent: Friday, January 05, 2007 2:27 PM
To: John McCawley
Cc: Clodoaldo; imageguy; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Database versus filesystem for storing images


"... and Moses said unto them, 'The eleventh commandment :  thou shalt
store images in a database!'..."

What if you had another database where you stored just the images and
not back it up if you don't want to?

As an application developer, I like the idea of storing files and images
in the database because it makes it much easier to control access and
security from an application standpoint.

I think Microsoft SQL Server stores blobs in a separate file, and only
retains pointers in the actually database field for that blob.  So when
you SELECT on that blob MS SQL reads the external file for you as if it
lived in the database.  I don't know if Postgres does the same thing,
but if it did, you wouldn't have to worry about "bloating" database files.

Sounds like this is for an Apache web application.  Think about how web
sites like Flickr and Webshots store their images in a database.  You
could write a cool Apache mod so that the url:
"http://mycompany.com/images/01234.jpg"  would go through this module,
pull the appropriate image from the database and send it back; all the
while the client is none-the-wiser.  Just a thought.

I think its one of those things where there's not right or wrong
answer.  Instead you just have to do the minimum of what your
application requires.  If you don't need application-level control over
the files, then by all means store them on the file system.  But if you
need to control security than you have to prevent physical access to the
file (which means no file system storage) and pull the image from the
database through the application.

My two cents,
James



John McCawley wrote:
> Don't store your images in the database.  Store them on the filesystem
> and store their path in the database.  Anyone that tells you otherwise
> is a stark raving madman :)
>
> My system is very heavily used, and our pg_dump is only a few gigs.
> Meanwhile our images/documents storage is well over a hundred gigs.
> I'd hate to think that I'd have to dump and restore 100 gigs every
> time I wanted to dump the newest data to the development database.
>
>
> As far as how they actually get to the client machine, typically these
> days people use web servers for this sort of thing.
> Clodoaldo wrote:
>
>> 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>:
>>
>>>
>>> I think I know the answer,
>>
>>
>> If you know the answer please tell it as I have read some discussions
>> on the web and although I have decided on a solution I'm still not
>> sure about the best answer, if there is a best answer after all.
>>
>>> but if you don't have an "application
>>> server" - ie a webserver, etc,
>>
>>
>> Yes I have an application server, the Apache server.
>>
>>> and many of the workstations/clients
>>> that need access to the images but may not have access to a network
>>> share,
>>
>>
>> network share? I don't understand. The images will be loaded by html
>> pages with the img tag like in <img
>> src="http://domain.com/images/xxx.jpg">
>>
>>> isn't the database the only choice ?
>>
>>
>> No. It is one of the choices. The other is to store the images in the
>> file system, in a directory readable by Apache.
>>
>>>  - or is there a postgresql function/utility that will "server" the
>>> file from the file system based on the reference/link embeded in the
>>> database ??
>>
>>
>> I think some procedure languages can read files. In this case what
>> would be the gain in introducing a middle man, the db server?
>>
>> Regards,
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org/
>


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq


Re: Database versus filesystem for storing images

From
John McCawley
Date:
My comment about stark raving madmen was somewhat tongue-in-cheek.
There is certainly a case to be made for images in a database under some
circumstances.  However, for the average Joe web developer, and
certainly someone new to the whole problem, I think storing them on the
filesystem is going to be by far the most painless and efficient approach.

James Neff wrote:

> "... and Moses said unto them, 'The eleventh commandment :  thou shalt
> store images in a database!'..."
>
> What if you had another database where you stored just the images and
> not back it up if you don't want to?
>
> As an application developer, I like the idea of storing files and
> images in the database because it makes it much easier to control
> access and security from an application standpoint.
>
> I think Microsoft SQL Server stores blobs in a separate file, and only
> retains pointers in the actually database field for that blob.  So
> when you SELECT on that blob MS SQL reads the external file for you as
> if it lived in the database.  I don't know if Postgres does the same
> thing, but if it did, you wouldn't have to worry about "bloating"
> database files.
>
> Sounds like this is for an Apache web application.  Think about how
> web sites like Flickr and Webshots store their images in a database.
> You could write a cool Apache mod so that the url:
> "http://mycompany.com/images/01234.jpg"  would go through this module,
> pull the appropriate image from the database and send it back; all the
> while the client is none-the-wiser.  Just a thought.
>
> I think its one of those things where there's not right or wrong
> answer.  Instead you just have to do the minimum of what your
> application requires.  If you don't need application-level control
> over the files, then by all means store them on the file system.  But
> if you need to control security than you have to prevent physical
> access to the file (which means no file system storage) and pull the
> image from the database through the application.
>
> My two cents,
> James


Re: Database versus filesystem for storing images

From
"Jeremy Haile"
Date:
How does it make it easier to control access and security?  If your web
app makes a decision about allowing access to the database, it can just
as easily make a decision about allowing access to the filesystem.
Storing the images on the file system doesn't mean that there isn't a
piece of code that determines whether or not users can access a
particular file.

I see security and access as a non-issue in making this decision.
Either way, it's got to be controlled at the application level (if it is
even necessary).



On Fri, 05 Jan 2007 15:26:45 -0500, "James Neff"
<jneff@tethyshealth.com> said:
> "... and Moses said unto them, 'The eleventh commandment :  thou shalt
> store images in a database!'..."
>
> What if you had another database where you stored just the images and
> not back it up if you don't want to?
>
> As an application developer, I like the idea of storing files and images
> in the database because it makes it much easier to control access and
> security from an application standpoint.
>
> I think Microsoft SQL Server stores blobs in a separate file, and only
> retains pointers in the actually database field for that blob.  So when
> you SELECT on that blob MS SQL reads the external file for you as if it
> lived in the database.  I don't know if Postgres does the same thing,
> but if it did, you wouldn't have to worry about "bloating" database
> files.
>
> Sounds like this is for an Apache web application.  Think about how web
> sites like Flickr and Webshots store their images in a database.  You
> could write a cool Apache mod so that the url:
> "http://mycompany.com/images/01234.jpg"  would go through this module,
> pull the appropriate image from the database and send it back; all the
> while the client is none-the-wiser.  Just a thought.
>
> I think its one of those things where there's not right or wrong
> answer.  Instead you just have to do the minimum of what your
> application requires.  If you don't need application-level control over
> the files, then by all means store them on the file system.  But if you
> need to control security than you have to prevent physical access to the
> file (which means no file system storage) and pull the image from the
> database through the application.
>
> My two cents,
> James
>
>
>
> John McCawley wrote:
> > Don't store your images in the database.  Store them on the filesystem
> > and store their path in the database.  Anyone that tells you otherwise
> > is a stark raving madman :)
> >
> > My system is very heavily used, and our pg_dump is only a few gigs.
> > Meanwhile our images/documents storage is well over a hundred gigs.
> > I'd hate to think that I'd have to dump and restore 100 gigs every
> > time I wanted to dump the newest data to the development database.
> >
> >
> > As far as how they actually get to the client machine, typically these
> > days people use web servers for this sort of thing.
> > Clodoaldo wrote:
> >
> >> 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>:
> >>
> >>>
> >>> I think I know the answer,
> >>
> >>
> >> If you know the answer please tell it as I have read some discussions
> >> on the web and although I have decided on a solution I'm still not
> >> sure about the best answer, if there is a best answer after all.
> >>
> >>> but if you don't have an "application
> >>> server" - ie a webserver, etc,
> >>
> >>
> >> Yes I have an application server, the Apache server.
> >>
> >>> and many of the workstations/clients
> >>> that need access to the images but may not have access to a network
> >>> share,
> >>
> >>
> >> network share? I don't understand. The images will be loaded by html
> >> pages with the img tag like in <img
> >> src="http://domain.com/images/xxx.jpg">
> >>
> >>> isn't the database the only choice ?
> >>
> >>
> >> No. It is one of the choices. The other is to store the images in the
> >> file system, in a directory readable by Apache.
> >>
> >>>  - or is there a postgresql function/utility that will "server" the
> >>> file from the file system based on the reference/link embeded in the
> >>> database ??
> >>
> >>
> >> I think some procedure languages can read files. In this case what
> >> would be the gain in introducing a middle man, the db server?
> >>
> >> Regards,
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> >               http://archives.postgresql.org/
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

Re: Database versus filesystem for storing images

From
Jorge Godoy
Date:
John McCawley <nospam@hardgeus.com> writes:

> This is a web app, so in my example all of the images live on a web server,
> and our data lives on a separate database server.  We have a completely
> duplicated setup offsite, and mirror images of every server at the backup
> site.  Every night we use rsync to duplicate everything offsite.  Also, a cron
> job pg_dumps every night and copies the dump over to the backup DB server.
>
> And before anybody gives me any guff, our office is in New Orleans, and we
> went through Katrina with less than an hour of downtime, and without losing
> anything.  So there ;)

Anyway, you have no guarantee that all your images exist on file and that all
existing files have a corresponding entry in your database.

--
Jorge Godoy      <jgodoy@gmail.com>

Re: Database versus filesystem for storing images

From
Jorge Godoy
Date:
Andrew Chernow <pg-job@esilo.com> writes:

>>I mean, how do you handle integrity with data
>> outside the database?
> You don't, the file system handles integrity of the stored data.  Although,
> one must careful to avoid db and fs orphans.  Meaning, a record with no
> corresponding file or a file with no corresponging record.  Always
> write()/insert an image file to the system within a transaction, including
> writing the image out to the fs.  Make sure to unlink any paritally written
> image files.

And how do you guarantee that after a failure?  You're restoring two
different sets of data here:

          - backup from your database
          - backup from your files

How do you link them together on that specific operation?  Or even on a daily
basis, if you get corrupted data...

>>>How do you plan your backup routine
> In regards to backup, backup the files one-by-one.  Grab the lastest image
> file refs from the database and start backing up those images.  Each
> successfully backed up image should be followed by inserting that file's
> database record into a remote db server.  If anything fails, cleanup the
> partial image file (to avoid orphaned data) and rollout the transaction.
>
> just one idea.  i'm sure there are other ways of doing it.  point is, this is
> completely possible to do reliably.

Wouldn't replication with, e.g., Slony be easier?  And wouldn't letting the
database handle all the integrity be easier?  I mean, create an "images" table
and then make your record depends on this table, so if there's no record with
the image, you won't have any references to it left.

It would also make the backup plan easier: backup the database.

Not counting that depending on your choice of filesystem and image size you
might get a very poor performance.



--
Jorge Godoy      <jgodoy@gmail.com>

Re: Database versus filesystem for storing images

From
Andrew Chernow
Date:
 > And how do you guarantee that after a failure?  You're restoring two
 > different sets of data here:

 > How do you link them together on that specific operation?  Or even on a daily
 > basis, if you get corrupted data...

I answered that already.

 > Not counting that depending on your choice of filesystem and image size you
 > might get a very poor performance.
apache has very good page and image caching.  You could take advantage of that
using this technique.

Another nice feature is the database and images can be handled spearately.  Some
people have seen this as a disadvantage on this thread, I personally don't see
it that why.

I guess it depends on access needs, many files and how much data you have.  What
if you had 3 billion files across a few hundred terabytes?  Can you say with
experience how the database would hold up in this situation?

andrew



Jorge Godoy wrote:
> Andrew Chernow <pg-job@esilo.com> writes:
>
>>> I mean, how do you handle integrity with data
>>> outside the database?
>> You don't, the file system handles integrity of the stored data.  Although,
>> one must careful to avoid db and fs orphans.  Meaning, a record with no
>> corresponding file or a file with no corresponging record.  Always
>> write()/insert an image file to the system within a transaction, including
>> writing the image out to the fs.  Make sure to unlink any paritally written
>> image files.
>
> And how do you guarantee that after a failure?  You're restoring two
> different sets of data here:
>
>           - backup from your database
>           - backup from your files
>
> How do you link them together on that specific operation?  Or even on a daily
> basis, if you get corrupted data...
>
>>>> How do you plan your backup routine
>> In regards to backup, backup the files one-by-one.  Grab the lastest image
>> file refs from the database and start backing up those images.  Each
>> successfully backed up image should be followed by inserting that file's
>> database record into a remote db server.  If anything fails, cleanup the
>> partial image file (to avoid orphaned data) and rollout the transaction.
>>
>> just one idea.  i'm sure there are other ways of doing it.  point is, this is
>> completely possible to do reliably.
>
> Wouldn't replication with, e.g., Slony be easier?  And wouldn't letting the
> database handle all the integrity be easier?  I mean, create an "images" table
> and then make your record depends on this table, so if there's no record with
> the image, you won't have any references to it left.
>
> It would also make the backup plan easier: backup the database.
>
> Not counting that depending on your choice of filesystem and image size you
> might get a very poor performance.
>
>
>

Re: Database versus filesystem for storing images

From
Bruno Wolff III
Date:
On Fri, Jan 05, 2007 at 15:26:45 -0500,
  James Neff <jneff@tethyshealth.com> wrote:
> "... and Moses said unto them, 'The eleventh commandment :  thou shalt
> store images in a database!'..."
>
> What if you had another database where you stored just the images and
> not back it up if you don't want to?

I think the main reason to keep images in the database is if you need
transactional semantics. If you are updating images and transactions that
started before the update, need to see the old version you are going to
want them in the database. I suspect this need isn't very common though.

Re: Database versus filesystem for storing images

From
"Jeremy Haile"
Date:
Yeah - it can make it easier to implement transactional semantics by
storing them in the database, although for simple operations it wouldn't
be hard to replicate this manually.  And you are going to incur a
performance penalty by storing them in the database.

Another thing to consider is that storing them in the file system makes
it much easier to browse the images using third-party tools, update
them, archive them (by gzipping or whatever).  This is much more
difficult if they are stored in the database.


On Fri, 5 Jan 2007 15:51:59 -0600, "Bruno Wolff III" <bruno@wolff.to>
said:
> On Fri, Jan 05, 2007 at 15:26:45 -0500,
>   James Neff <jneff@tethyshealth.com> wrote:
> > "... and Moses said unto them, 'The eleventh commandment :  thou shalt
> > store images in a database!'..."
> >
> > What if you had another database where you stored just the images and
> > not back it up if you don't want to?
>
> I think the main reason to keep images in the database is if you need
> transactional semantics. If you are updating images and transactions that
> started before the update, need to see the old version you are going to
> want them in the database. I suspect this need isn't very common though.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

Re: Database versus filesystem for storing images

From
Scott Marlowe
Date:
On Fri, 2007-01-05 at 15:54, Jeremy Haile wrote:
> Yeah - it can make it easier to implement transactional semantics by
> storing them in the database, although for simple operations it wouldn't
> be hard to replicate this manually.  And you are going to incur a
> performance penalty by storing them in the database.
>
> Another thing to consider is that storing them in the file system makes
> it much easier to browse the images using third-party tools, update
> them, archive them (by gzipping or whatever).  This is much more
> difficult if they are stored in the database.

The biggest performance penalty from them being in the same database as
your data is that they're going to be the majority of your kernel /
database buffers.  So, as mentioned earlier, it's almost a necessity
(for performance reasons) to put them into their own db server.

Re: Database versus filesystem for storing images

From
Jorge Godoy
Date:
"Jeremy Haile" <jhaile@fastmail.fm> writes:

> Another thing to consider is that storing them in the file system makes
> it much easier to browse the images using third-party tools, update
> them, archive them (by gzipping or whatever).  This is much more
> difficult if they are stored in the database.

If you touch the files with third-party tools how are you going to prevent
that they aren't missing when the database say they are there?  If you're
referencing them somehow, you have to guarantee that they are there...  Or am
I the only one that is thinking about referential integrity with those files?

--
Jorge Godoy      <jgodoy@gmail.com>

Re: Database versus filesystem for storing images

From
John McCawley
Date:
If you end up storing the pictures on a separate database server for
performance reasons, Jorge's argument regarding integrity becomes
moot...And his argument so far is the strongest case I've seen for
storing the files in the database.

Scott Marlowe wrote:

>On Fri, 2007-01-05 at 15:54, Jeremy Haile wrote:
>
>
>>Yeah - it can make it easier to implement transactional semantics by
>>storing them in the database, although for simple operations it wouldn't
>>be hard to replicate this manually.  And you are going to incur a
>>performance penalty by storing them in the database.
>>
>>Another thing to consider is that storing them in the file system makes
>>it much easier to browse the images using third-party tools, update
>>them, archive them (by gzipping or whatever).  This is much more
>>difficult if they are stored in the database.
>>
>>
>
>The biggest performance penalty from them being in the same database as
>your data is that they're going to be the majority of your kernel /
>database buffers.  So, as mentioned earlier, it's almost a necessity
>(for performance reasons) to put them into their own db server.
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: don't forget to increase your free space map settings
>
>

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: Database versus filesystem for storing images

From
Andrew Chernow
Date:
 >> Or am
 >>I the only one that is thinking about referential integrity with those files?
Not at all.  I'm not sure how 3rd party tools like apache, `ls`, `gzip`, `find`,
nfs, etc... are breaking integrity.  Any php, jsp, C or shell script you write
would be doing the same thing, accessing the data.  All your doing is making
your system more accessible to a wider range of tools, other than your own.

Just like you are cautious about not deleting the pg_data folder, big no-no, you
need to be cautious about not deleting or modifying these image files.
Basically, the image files are an extension of the database that you would glue
together.  I think there is a clear separation of tasks here.  I think this is
required if you were handling any sizeable amounts of data.

The other thing is the original poster needs apache to access these images.
This is a requirement of his/her project.  Probably a good idea to meet those
requirements.  It is far more effecient to have apache access them directly then
pounding your database with web requests for image file data.  It is good
design, and distribution of tasks, to get the image paths from the database and
and have apache server the data; select images paths from php or something.  Now
you can have the data anywhere, on a different server, over an nfs mount, gfs,
wherever.  Much more flexible and distributed.

Andrew


Jorge Godoy wrote:
> "Jeremy Haile" <jhaile@fastmail.fm> writes:
>
>> Another thing to consider is that storing them in the file system makes
>> it much easier to browse the images using third-party tools, update
>> them, archive them (by gzipping or whatever).  This is much more
>> difficult if they are stored in the database.
>
> If you touch the files with third-party tools how are you going to prevent
> that they aren't missing when the database say they are there?  If you're
> referencing them somehow, you have to guarantee that they are there...  Or am
> I the only one that is thinking about referential integrity with those files?
>

Re: Database versus filesystem for storing images

From
Jorge Godoy
Date:
Andrew Chernow <pg-job@esilo.com> writes:

>> And how do you guarantee that after a failure?  You're restoring two
>> different sets of data here:
>
>> How do you link them together on that specific operation?  Or even on a daily
>> basis, if you get corrupted data...
>
> I answered that already.

I'm sorry.  It must be the flu, the pain or something else, but I really don't
remember reading your message about how you can be 100% sure that all
references to the filesystem have their corresponding files present and also
all present files have their respective database entry.

I've seen HA measures (I don't imagine anyone sacrificing their customers
copying 3 billion files and a few hundred terabytes while still maintaining an
adequate service rate with part of its infra-structure down, just to use your
example to that answer...), ideas about requiring an answer from the
filesystem before considering the transaction done DB-wise (who grants you
that the image really went to the disk and is not on cache when the machine
has a power failure and shuts down abruptly?)...

I might have missed your message, though.  Would you be gentle to quote that
again, please?

> Another nice feature is the database and images can be handled spearately.

What might be bad.

> Some people have seen this as a disadvantage on this thread, I personally
> don't see it that why.

I am questioning two points that show two situations where it is bad.
Specially if those images are important to the records (e.g. product failure
images, prize winning images, product specs, prototype images, blueprints --
after all, we don't need to restrict our files to images, right? --,
agreements, spreadsheets with the last years of company account movements,
documents received from lawyers, etc.).

> I guess it depends on access needs, many files and how much data you have.
> What if you had 3 billion files across a few hundred terabytes?  Can you say
> with experience how the database would hold up in this situation?

I'd have partitioning if I had a case like that.  Part of those would be
delegated to one machine, part to another and so on.  Even if that solution --
partitioning -- makes the overall MTBF lower...

And I still can't imagine how you guarantee that all 3 billion files have
their corresponding entries on the database.  Couting them is not enough since
I can have one file with the wrong "name" present on the filesystem or some
duplicate record on the DB...


--
Jorge Godoy      <jgodoy@gmail.com>

Re: Database versus filesystem for storing images

From
"Jeremy Haile"
Date:
Referential integrity would be one positive for storing the files in the
database.  I wasn't responding to that.  I'm simply saying that browsing
them with third-party tools, updating, compressing/archiving, etc. is
easier if they are not in the database.  Those are all actions that I've
found useful on other projects when storing user-uploaded images.

Depending upon the number of images on disk, it might not be hard to
verify that all the images referenced from the DB are there.  You could
have a small program that steps through each record and verifies its
existence on disk.  That could easily be run after a restore or as a
batch-process.  If you're talking about trillions of images - sure
that'd be tough.  If these images are extremely important - maybe that
solution is unacceptable.  But this is just a case of too little
information to make a decision.

There are pros and cons to both approach - in every project I've worked
on that faced this decision, I felt the pros of storing it in the file
system outweighed the pros of storing it in the DB.  But there is no
right or wrong answer to the question (unless of course you had special
circumstances that made one option clearly superior - but I don't think
we know enough details to make that call)

My two cents,
Jeremy Haile

On Fri, 05 Jan 2007 20:24:05 -0200, "Jorge Godoy" <jgodoy@gmail.com>
said:
> "Jeremy Haile" <jhaile@fastmail.fm> writes:
>
> > Another thing to consider is that storing them in the file system makes
> > it much easier to browse the images using third-party tools, update
> > them, archive them (by gzipping or whatever).  This is much more
> > difficult if they are stored in the database.
>
> If you touch the files with third-party tools how are you going to
> prevent
> that they aren't missing when the database say they are there?  If you're
> referencing them somehow, you have to guarantee that they are there...
> Or am
> I the only one that is thinking about referential integrity with those
> files?
>
> --
> Jorge Godoy      <jgodoy@gmail.com>

Re: Database versus filesystem for storing images

From
"imageguy"
Date:
Scott Ribe wrote:
> Personally, I'd put them on the file system, because then backup software
> can perform incremental backups. In the database, that becomes more of a
> difficulty. One suggestion, don't use a file name from a hash to store the
> image, just use the serial id, and break them up by hundreds or thousands,
> iow image 1123 might be in images/000/000001/000001123.
>
> --
> Scott Ribe
> scott_ribe@killerbytes.com
> http://www.killerbytes.com/
> (303) 722-0567 voice

I think I know the answer, but if you don't have an "application
server" - ie a webserver, etc, and many of the workstations/clients
that need access to the images but may not have access to a network
share, isn't the database the only choice ?

 - or is there a postgresql function/utility that will "server" the
file from the file system based on the reference/link embeded in the
database ??

Geoff.


Re: Database versus filesystem for storing images

From
Ragnar
Date:
On fös, 2007-01-05 at 15:49 -0500, Andrew Chernow wrote:
> I 100% agree.  Use the database as a lookup into the filesystem.  Don't load the
> database up with terabytes of non-searchable binary data?  not sure how that
> would help you?
>

>  >I mean, how do you handle integrity with data
>  > outside the database?
> You don't, the file system handles integrity of the stored data.  Although, one
> must careful to avoid db and fs orphans.  Meaning, a record with no
> corresponding file or a file with no corresponging record.  Always
> write()/insert an image file to the system within a transaction, including
> writing the image out to the fs.  Make sure to unlink any paritally written
> image files.

what happens if you rollback a transaction that just updated
an image file?

for that matter, what happens if one transaction is using or
even reading an image while another is updating it?

gnari



Re: Database versus filesystem for storing images

From
Andrew Chernow
Date:
 > what happens if you rollback a transaction that just updated
 > an image file?
 >
 > for that matter, what happens if one transaction is using or
 > even reading an image while another is updating it?

One thing I mentioned was about a point in time backup, not updating the image.
  This would rollback the transaction on the backup server, not the live one.

I was also proposing to never update an image.  Just generate a new file, you
got 64-bits worth of ids.  If you need to rollback an update to image on the
live server, unlink() the file and rollback.   If all is good, unlink the old
version.

andrew



Ragnar wrote:
> On fös, 2007-01-05 at 15:49 -0500, Andrew Chernow wrote:
>> I 100% agree.  Use the database as a lookup into the filesystem.  Don't load the
>> database up with terabytes of non-searchable binary data?  not sure how that
>> would help you?
>>
>
>>  >I mean, how do you handle integrity with data
>>  > outside the database?
>> You don't, the file system handles integrity of the stored data.  Although, one
>> must careful to avoid db and fs orphans.  Meaning, a record with no
>> corresponding file or a file with no corresponging record.  Always
>> write()/insert an image file to the system within a transaction, including
>> writing the image out to the fs.  Make sure to unlink any paritally written
>> image files.
>
> what happens if you rollback a transaction that just updated
> an image file?
>
> for that matter, what happens if one transaction is using or
> even reading an image while another is updating it?
>
> gnari
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>
>

Re: Database versus filesystem for storing images

From
Andrew Chernow
Date:
 > copying 3 billion files and a few hundred terabytes while still maintaining an
 > adequate service rate with part of its infra-structure down, just to use your

I wasn't saying to do this each time you run a backup, geez that would be
horrible.  Pickup from where you left off the last time you backed up
data/records.  How many images and how much data is being generated in a 60
second period?  I dought 3 billion files and hundreds of terabytes.  When you
know what your data generation is, you know what resources you need to replicate
this information to a backup server (local or remote).

How is this any different than db replication.  It would have to backup the same
amount of information?  You would require the same horse power and bandwidth.

andrew


Jorge Godoy wrote:
> Andrew Chernow <pg-job@esilo.com> writes:
>
>>> And how do you guarantee that after a failure?  You're restoring two
>>> different sets of data here:
>>> How do you link them together on that specific operation?  Or even on a daily
>>> basis, if you get corrupted data...
>> I answered that already.
>
> I'm sorry.  It must be the flu, the pain or something else, but I really don't
> remember reading your message about how you can be 100% sure that all
> references to the filesystem have their corresponding files present and also
> all present files have their respective database entry.
>
> I've seen HA measures (I don't imagine anyone sacrificing their customers
> copying 3 billion files and a few hundred terabytes while still maintaining an
> adequate service rate with part of its infra-structure down, just to use your
> example to that answer...), ideas about requiring an answer from the
> filesystem before considering the transaction done DB-wise (who grants you
> that the image really went to the disk and is not on cache when the machine
> has a power failure and shuts down abruptly?)...
>
> I might have missed your message, though.  Would you be gentle to quote that
> again, please?
>
>> Another nice feature is the database and images can be handled spearately.
>
> What might be bad.
>
>> Some people have seen this as a disadvantage on this thread, I personally
>> don't see it that why.
>
> I am questioning two points that show two situations where it is bad.
> Specially if those images are important to the records (e.g. product failure
> images, prize winning images, product specs, prototype images, blueprints --
> after all, we don't need to restrict our files to images, right? --,
> agreements, spreadsheets with the last years of company account movements,
> documents received from lawyers, etc.).
>
>> I guess it depends on access needs, many files and how much data you have.
>> What if you had 3 billion files across a few hundred terabytes?  Can you say
>> with experience how the database would hold up in this situation?
>
> I'd have partitioning if I had a case like that.  Part of those would be
> delegated to one machine, part to another and so on.  Even if that solution --
> partitioning -- makes the overall MTBF lower...
>
> And I still can't imagine how you guarantee that all 3 billion files have
> their corresponding entries on the database.  Couting them is not enough since
> I can have one file with the wrong "name" present on the filesystem or some
> duplicate record on the DB...
>
>

Re: Database versus filesystem for storing images

From
Jorge Godoy
Date:
Ragnar <gnari@hive.is> writes:

> for that matter, what happens if one transaction is using or
> even reading an image while another is updating it?

I believe that this also depends on how the file is updated.  Some
applications create a temporary file with the new (or changed) content and
then replace the old file with this one.

There are also cases where you can keep a file descriptor open to one file
that has no name (it existed once) and that was replaced.

It all depends on how / who's generating and how / who's consuming the
information.

--
Jorge Godoy      <jgodoy@gmail.com>

Re: Database versus filesystem for storing images

From
Jorge Godoy
Date:
Andrew Chernow <pg-job@esilo.com> writes:

>>> Or am
>>>I the only one that is thinking about referential integrity with those files?
> Not at all.  I'm not sure how 3rd party tools like apache, `ls`, `gzip`,
> `find`, nfs, etc... are breaking integrity.  Any php, jsp, C or shell script

For gzip, for example:

    - DB record contains "/some/dir/file.ext"
    - Filesystem contains "/some/dir/file.ext.gz"

NFS can also be guilty if it fails or the server goes down.  If I have a share
mounted as "/some/remote/dir" and I say that the file is at
"/some/remote/dir/file.ext" but the NFS server is down then it is the same as
if the file didn't exist at all since it can't be reached.

For both cases, if the file is inside the database and I am referencing it
then I know that it *is* there.  Referential integrity takes care of that for
me with no cost or any other action of mine.

> you write would be doing the same thing, accessing the data.  All your doing
> is making your system more accessible to a wider range of tools, other than
> your own.

And introducint more failing points.

> Just like you are cautious about not deleting the pg_data folder, big no-no,
> you need to be cautious about not deleting or modifying these image
> files. Basically, the image files are an extension of the database that you
> would glue together.  I think there is a clear separation of tasks here.  I
> think this is required if you were handling any sizeable amounts of data.

So you have added the possibility of manipulating (which is different from
reading or accessing) the files directly but you say "don't touch them!".

> The other thing is the original poster needs apache to access these
> images. This is a requirement of his/her project.  Probably a good idea to

And nothing prevents those files from being served from the database.

> meet those requirements.  It is far more effecient to have apache access
> them

Where weren't we meeting his/her requirements?  All the discussion is around
available means to do that.  One option is having the files on the database,
the other is on the filesystem.  From my understanding we're discussing the
benefits of each one.  Aren't we?

> directly then pounding your database with web requests for image file data.

It might be.  If you can be certain that the image *is* there when it tries to
access it.  Both examples above -- gzip + NFS -- show two ways of having
different things inside the DB and on the FS.

> It is good design, and distribution of tasks, to get the image paths from the
> database and and have apache server the data; select images paths from php or
> something.  Now you can have the data anywhere, on a different server, over an
> nfs mount, gfs, wherever.  Much more flexible and distributed.

And also more uncertain that the referred data is there at all.

--
Jorge Godoy      <jgodoy@gmail.com>


Re: Database versus filesystem for storing images

From
Andrew Chernow
Date:
 > And introducint more failing points.
depends on how you do it.  not everything has to go in a database to be
reliable.  Part of good engineering is realizing when to use a tool and when not
to.

I think a 10K view of the issue is in order.  The bigger picture is being
missed, or I am not communicating well.

I bet if I gave you a million dollars, you could implement what I proposed; in a
reliable, redundant fasion.  You could probably convince anyone of its merits,
making my case for me.  All you would have to do is entertain the idea ... I
mean you sound smart enough ... probably smarter than me :)

andrew



Jorge Godoy wrote:
> Andrew Chernow <pg-job@esilo.com> writes:
>
>>>> Or am
>>>> I the only one that is thinking about referential integrity with those files?
>> Not at all.  I'm not sure how 3rd party tools like apache, `ls`, `gzip`,
>> `find`, nfs, etc... are breaking integrity.  Any php, jsp, C or shell script
>
> For gzip, for example:
>
>     - DB record contains "/some/dir/file.ext"
>     - Filesystem contains "/some/dir/file.ext.gz"
>
> NFS can also be guilty if it fails or the server goes down.  If I have a share
> mounted as "/some/remote/dir" and I say that the file is at
> "/some/remote/dir/file.ext" but the NFS server is down then it is the same as
> if the file didn't exist at all since it can't be reached.
>
> For both cases, if the file is inside the database and I am referencing it
> then I know that it *is* there.  Referential integrity takes care of that for
> me with no cost or any other action of mine.
>
>> you write would be doing the same thing, accessing the data.  All your doing
>> is making your system more accessible to a wider range of tools, other than
>> your own.
>
> And introducint more failing points.
>
>> Just like you are cautious about not deleting the pg_data folder, big no-no,
>> you need to be cautious about not deleting or modifying these image
>> files. Basically, the image files are an extension of the database that you
>> would glue together.  I think there is a clear separation of tasks here.  I
>> think this is required if you were handling any sizeable amounts of data.
>
> So you have added the possibility of manipulating (which is different from
> reading or accessing) the files directly but you say "don't touch them!".
>
>> The other thing is the original poster needs apache to access these
>> images. This is a requirement of his/her project.  Probably a good idea to
>
> And nothing prevents those files from being served from the database.
>
>> meet those requirements.  It is far more effecient to have apache access
>> them
>
> Where weren't we meeting his/her requirements?  All the discussion is around
> available means to do that.  One option is having the files on the database,
> the other is on the filesystem.  From my understanding we're discussing the
> benefits of each one.  Aren't we?
>
>> directly then pounding your database with web requests for image file data.
>
> It might be.  If you can be certain that the image *is* there when it tries to
> access it.  Both examples above -- gzip + NFS -- show two ways of having
> different things inside the DB and on the FS.
>
>> It is good design, and distribution of tasks, to get the image paths from the
>> database and and have apache server the data; select images paths from php or
>> something.  Now you can have the data anywhere, on a different server, over an
>> nfs mount, gfs, wherever.  Much more flexible and distributed.
>
> And also more uncertain that the referred data is there at all.
>

Re: Database versus filesystem for storing images

From
Jorge Godoy
Date:
Andrew Chernow <pg-job@esilo.com> writes:

> I wasn't saying to do this each time you run a backup, geez that would be
> horrible.  Pickup from where you left off the last time you backed up
> data/records.  How many images and how much data is being generated in a 60
> second period?  I dought 3 billion files and hundreds of terabytes.  When you
> know what your data generation is, you know what resources you need to
> replicate this information to a backup server (local or remote).

I'm not talking about backups.  I'm talking about restores.

> How is this any different than db replication.  It would have to backup the
> same amount of information?  You would require the same horse power and
> bandwidth.

The difference is that I'd restore the data and then sync the difference from
some point in time to "now".  The referential integrity would be guaranteed by
the database itself and I won't have any pointers to files that doesn't exist
or files without pointers to it.

--
Jorge Godoy      <jgodoy@gmail.com>

Re: Database versus filesystem for storing images

From
Jorge Godoy
Date:
Andrew Chernow <pg-job@esilo.com> writes:

> depends on how you do it.  not everything has to go in a database to be
> reliable.  Part of good engineering is realizing when to use a tool and when
> not to.
>
> I think a 10K view of the issue is in order.  The bigger picture is being
> missed, or I am not communicating well.
>
> I bet if I gave you a million dollars, you could implement what I proposed;
> in a reliable, redundant fasion.  You could probably convince anyone of its
> merits, making my case for me.  All you would have to do is entertain the
> idea ... I mean you sound smart enough ... probably smarter than me :)

LOL.  :-)  This proposal was made to Demi Moore and her marriage ended in a
disaster... ;-)

My point is: if I need to be 100% sure that what is referenced on the database
is accessible all the time when the reference is, then I need to have this on
the database; if it can be missed sometimes then using the filesystem is
acceptable.

I'd include on the "100%" category fiscal records, client's documents, company
projects, etc.  On the second the drawing the user made of his grandmother,
the picture of the pretty neighbor in the shower, all reports generated
automatically from data existing on the database itself, etc.


I have systems with both solutions and the ones that give me more trouble to
manage are the ones that use external files.  If nobody has access to the
server, then it is a piece of cake.  But if there are other people touching
things -- moving / creating / deleting / changing files and directories --
then things get more complicated to manage.  Inside the DB they usually don't
dare to touch things because they know that they might destroy very important
data.  It is something that is inherent to people that everything the board of
directors use might cost their job.  Even if it is just the coffee
plate... :-)


Anyway, do you need my account number? ;-)  You can leave the transfer charges
to me, after all friends do that ;-)


And I am not smarter than anyone that listen to and learn with other people.
I am still learning here since I'm really a newbie in this database world.
Thanks for that, though :-)

--
Jorge Godoy      <jgodoy@gmail.com>

Re: Database versus filesystem for storing images

From
"imageguy"
Date:
Clodoaldo wrote:
> 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>:
> >
> > I think I know the answer,
>
> If you know the answer please tell it as I have read some discussions
> on the web and although I have decided on a solution I'm still not
> sure about the best answer, if there is a best answer after all.

Sorry, didn't mean to sound like and expert on this, I am actually
quite a newbie.  From all of the discussions I have read and even the
ones in this thread, including your own comments below, it would seem
that to store the files in the files system you need some sort of
application erver or :middleware - like a webserver - to handle the
retreiving and serving of the files.
My organization is developing a commercial application for "document
tracking". It is not a Browser application, but rather a more
traditional "windows" thick client app.

At the present time we do not intend to deploy any sort of "application
server" - web server, ftp server, and not all of the workstations will
have access to a consistent network share.

So in this case, it is my understanding that our only real choice is to
store the documents and images in the database itself.

... unless someone knows of a postgresql function that would allow us
to "server" the file from the filesystem via the dbserver ??


>
> > but if you don't have an "application
> > server" - ie a webserver, etc,
>
> Yes I have an application server, the Apache server.
>
> > and many of the workstations/clients
> > that need access to the images but may not have access to a network
> > share,
>
> network share? I don't understand. The images will be loaded by html
> pages with the img tag like in <img
> src="http://domain.com/images/xxx.jpg">
>
> > isn't the database the only choice ?
>
> No. It is one of the choices. The other is to store the images in the
> file system, in a directory readable by Apache.

See above.  WE are trying to reduce the dependancies on other
applications to ensure a simple deployment of the application.


Re: Database versus filesystem for storing images

From
Clodoaldo
Date:
2007/1/5, Jorge Godoy <jgodoy@gmail.com>:
> Andrew Chernow <pg-job@esilo.com> writes:
> > meet those requirements.  It is far more effecient to have apache access
> > them
>
> Where weren't we meeting his/her requirements?  All the discussion is around
> available means to do that.  One option is having the files on the database,
> the other is on the filesystem.  From my understanding we're discussing the
> benefits of each one.  Aren't we?

Yes, although I suggested two solutions I asked for anything that
would be considered the best practice. Now I think there is not a best
practice or better, there should be one best practice for each of the
solutions.

I have done an intranet application that stored images in the
database. It worked perfectly and I used the same engine in another
intranet application to store not only images but any document which
also worked perfectly.  The decision to go the dabatase only route was
easy: The filesystem space would have to be negotiated while the space
occupied by the databases were not controlled and used an advanced
storage solution that gave lots of terabytes to be used at will. Also
the any document application should not loose a single document and
access control should be strictly enforced which was much easier to do
with the database since I had no control over the webserver and even
if I had I think the database access is still easier to control than
the filesystem access. That was in a corporate intranet.

What I'm doing now is an internet application. While the FS x DB
synchronicity is very important in some kinds of document management,
it is not in this application. Indeed if a few images are lost each
day it has no meaning in a 500K to 1M inventory. The offended clients
just upload them again. No one will be sued. The images are all
public. No need to control the access.

But the main factor to push me in the file system direction is the
HTTP cache management. I want the internet web clients and proxies to
cache the images. The Apache web server has it ready and easy. If the
images where to be stored in the DB I would have to handle the HTTP
cache headers myself. Another code layer. Not too big a deal, but if
Apache give me it for free...

I wonder why this HTTP cache headers argument didn't surface in this
heated debate. Aren't DB developers/admins aware of the internet
client's bandwidth limitations? Or they just assume the application
would handle the HTTP cache headers? In the applications I created for
intranet bandwidth was almost a non issue and I didn't care to make
them bandwidth efficient, but for the internet the problem is there
and it is big.

Regards,
--
Clodoaldo Pinto Neto

Re: Database versus filesystem for storing images

From
Clodoaldo
Date:
2007/1/6, Andrew Chernow <andrew@esilo.com>:
>  >>>apache has very good page and image caching.  You could take advantage
>  >>>of that using this technique.
>
>  > I wonder why this HTTP cache headers argument didn't surface in this
>  > heated debate.
>
> I did other up this argument by the way.

Sorry, I understood you were talking about server side caching while
what I refer to is client side caching.

Regards,
--
Clodoaldo Pinto Neto

Re: Database versus filesystem for storing images

From
"Merlin Moncure"
Date:
On 1/6/07, Jorge Godoy <jgodoy@gmail.com> wrote:
> Andrew Chernow <pg-job@esilo.com> writes:
>
> >>I mean, how do you handle integrity with data
> >> outside the database?
> > You don't, the file system handles integrity of the stored data.  Although,
> > one must careful to avoid db and fs orphans.  Meaning, a record with no
> > corresponding file or a file with no corresponging record.  Always
> > write()/insert an image file to the system within a transaction, including
> > writing the image out to the fs.  Make sure to unlink any paritally written
> > image files.
>
> And how do you guarantee that after a failure?  You're restoring two
> different sets of data here:
>
>           - backup from your database
>           - backup from your files

you have a point -- keeping two sets of data in sync is more difficult
than working of a single interface (you have to implement your own
referential integrity of sorts),  but that has to be balanced against
the fact that postgresql is not great (yet) at storing and retrieving
huge numbers of large binary objects.  then again, neither are most
filesystems in my opinion.  also, and this is coming from a sql junky,
sql is not really that good at handling binary data, you have to go
right to binary prepared statements before things become even remotely
reasonable.

taking your side for a moment, the backup argument (against databases)
is a little bit disingenuous because while dumping a 10 terabyte
database is a pain, backing up a 10 terabyte filesystem is no picnic
either, rsync will get seizures...you have to implement some smart
logic that takes advantage of the fact that the information rarely if
ever changes.

andrew's posts hint at a way to do that that could be implemented
directly in a database or a filesystem...one would choose a filesystem
for this because the overhead is lower but there are other ways...

> How do you link them together on that specific operation?  Or even on a daily
> basis, if you get corrupted data...

if you take this approach, you have to layer a middleware over the
filesystem and use that always.  it's a programming challenge but it
can be done...

> >>>How do you plan your backup routine
> > In regards to backup, backup the files one-by-one.  Grab the lastest image
> > file refs from the database and start backing up those images.  Each
> > successfully backed up image should be followed by inserting that file's
> > database record into a remote db server.  If anything fails, cleanup the
> > partial image file (to avoid orphaned data) and rollout the transaction.
> >
> > just one idea.  i'm sure there are other ways of doing it.  point is, this is
> > completely possible to do reliably.
>
> Wouldn't replication with, e.g., Slony be easier?  And wouldn't letting the
> database handle all the integrity be easier?  I mean, create an "images" table
> and then make your record depends on this table, so if there's no record with
> the image, you won't have any references to it left.

I think if you were to replicate a really big database, for something
like this, a log based replication approach (pitr, or a modified
flavor of it) would be a better bet.  pg_dump should not even enter
into your vocabulary, unless you did some hackery like storing data in
progressive tables.

> It would also make the backup plan easier: backup the database.

backing up big *anythings* can be tough, pg_dump is not a scalable tool.

> Not counting that depending on your choice of filesystem and image size you
> might get a very poor performance.

performance of storing large blobs in the database is not great...if
you can get 50mb/sec writing to a filesystem, count yourself lucky if
you get 10mb/sec writing to the database, and this is only if you make
your client code very efficient.

I think at some point in the future postgresql might be a useful tool
in the way you are describing.  currently, there are some performance
issues, more flexibility in backup options would be required, and
maybe some other things.

merlin

Re: Database versus filesystem for storing images

From
"Jeremy Haile"
Date:
> I wonder why this HTTP cache headers argument didn't surface in this
> heated debate.

I mentioned this earlier as well.  Although you could do it in the app
layer - it would be easier to just let the web server handle it.

Re: Database versus filesystem for storing images

From
John McCawley
Date:
Is there any overwhelming reason you can't just stick an apache server
on your DB server?  Unless you expect this thing to get hit *hard*, the
performance of having them both on the same machine is pretty acceptable
(I know, everyone's opinion about what constitutes heavy usage
differs).  If this is a simple intranet application with around 100
users, the performance should be OK.

If I had to write a client-side app to pull images off of a server, I'd
much rather code HTTP GETs into my app than deal with binary data from
the database.  I've done both, and HTTP GETs are much easier.  Bear in
mind I haven't pulled binaries from a database in about 5 years, so
things might be easier now.

imageguy wrote:

>Clodoaldo wrote:
>
>
>>5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>:
>>
>>
>>>I think I know the answer,
>>>
>>>
>>If you know the answer please tell it as I have read some discussions
>>on the web and although I have decided on a solution I'm still not
>>sure about the best answer, if there is a best answer after all.
>>
>>
>
>Sorry, didn't mean to sound like and expert on this, I am actually
>quite a newbie.  From all of the discussions I have read and even the
>ones in this thread, including your own comments below, it would seem
>that to store the files in the files system you need some sort of
>application erver or :middleware - like a webserver - to handle the
>retreiving and serving of the files.
>My organization is developing a commercial application for "document
>tracking". It is not a Browser application, but rather a more
>traditional "windows" thick client app.
>
>At the present time we do not intend to deploy any sort of "application
>server" - web server, ftp server, and not all of the workstations will
>have access to a consistent network share.
>
>So in this case, it is my understanding that our only real choice is to
>store the documents and images in the database itself.
>
>... unless someone knows of a postgresql function that would allow us
>to "server" the file from the filesystem via the dbserver ??
>
>
>
>
>>>but if you don't have an "application
>>>server" - ie a webserver, etc,
>>>
>>>
>>Yes I have an application server, the Apache server.
>>
>>
>>
>>>and many of the workstations/clients
>>>that need access to the images but may not have access to a network
>>>share,
>>>
>>>
>>network share? I don't understand. The images will be loaded by html
>>pages with the img tag like in <img
>>src="http://domain.com/images/xxx.jpg">
>>
>>
>>
>>>isn't the database the only choice ?
>>>
>>>
>>No. It is one of the choices. The other is to store the images in the
>>file system, in a directory readable by Apache.
>>
>>
>
>See above.  WE are trying to reduce the dependancies on other
>applications to ensure a simple deployment of the application.
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org/
>
>

Re: Database versus filesystem for storing images

From
"Dawid Kuroczko"
Date:
On 1/5/07, Jorge Godoy <jgodoy@gmail.com> wrote:
> Andrew Chernow <pg-job@esilo.com> writes:
>
> >> And how do you guarantee that after a failure?  You're restoring two
> >> different sets of data here:
> >
> >> How do you link them together on that specific operation?  Or even on a daily
> >> basis, if you get corrupted data...
> >
> > I answered that already.
>
> I'm sorry.  It must be the flu, the pain or something else, but I really don't
> remember reading your message about how you can be 100% sure that all
> references to the filesystem have their corresponding files present and also
> all present files have their respective database entry.

By designing the system so that you can be sure.  For instance delegate
removing data from filesystem to a dedicated queue table within database,
and carefully check that code.  Let no other software delete data.

If you need PITR, you can stop the remover utility during backups (think of
it as a VACUUM for filesystem ;)).

> I've seen HA measures (I don't imagine anyone sacrificing their customers
> copying 3 billion files and a few hundred terabytes while still maintaining an
> adequate service rate with part of its infra-structure down, just to use your
> example to that answer...), ideas about requiring an answer from the
> filesystem before considering the transaction done DB-wise (who grants you
> that the image really went to the disk and is not on cache when the machine
> has a power failure and shuts down abruptly?)...

And who grants you that the WAL log file really went to the disk and is not on
cache when the machine has a power failure and shuts down abruptly?

Use a trustworthy hardware.  You would have to do it anyway, if you wanted
to go with "all-in-DB" approach.

> > Some people have seen this as a disadvantage on this thread, I personally
> > don't see it that why.
>
> I am questioning two points that show two situations where it is bad.
> Specially if those images are important to the records (e.g. product failure
> images, prize winning images, product specs, prototype images, blueprints --
> after all, we don't need to restrict our files to images, right? --,
> agreements, spreadsheets with the last years of company account movements,
> documents received from lawyers, etc.).

I don't think noone is saying that storing images in DB isn't better from
data integrity point of view.  But it has drawbacks, which sometimes make
pepople store their images elsewhere in real life.

Of course if one had infinite budget...  But other than that, if you are storing
lots of data, and you can afford a trade-off between data safety (that you
can loose some data or that your data can be bloated by dead data) and
costs, then you have to consider storing data otherwise.

> > I guess it depends on access needs, many files and how much data you have.
> > What if you had 3 billion files across a few hundred terabytes?  Can you say
> > with experience how the database would hold up in this situation?
>
> I'd have partitioning if I had a case like that.  Part of those would be
> delegated to one machine, part to another and so on.  Even if that solution --
> partitioning -- makes the overall MTBF lower...

And how do you handle data integrity between many machines? The answer
is of 2PC, I guess.  But still, managing integrity between many machines is
also hard -- remember, you don't have RI constraints for remote data.

> And I still can't imagine how you guarantee that all 3 billion files have
> their corresponding entries on the database.  Couting them is not enough since
> I can have one file with the wrong "name" present on the filesystem or some
> duplicate record on the DB...

Depends what you mean by integrity.

For my needs it is sufficient that I can guarantee that every object mentioned
in a database is present on the filesystem.  Orphaned files are not a problem
and can be removed (if there really is need to) by a monthly cron job ("find
old files not mentioned in DB", "find not accessed old files and check if they
are in DB", etc. etc.).

More important still, is that I can make a snapshot of data.  With filesystem
it would be just:
  stop remover utility
  pg_dump  &  tar (or whatever) the files
  start remover utility

With DB, one would have to start pg_dump and wait until it finishes.
And observe
how your tables start to bloat as the data needs to be preserved for
data integrity
reasons.  How your database is more and mroe filled with dead tuples.  And how,
after it finishes VACUUM struggles to clean up the tables.  The more data you,
have, the more frequently it changes, the bigger problem it is.
Filesystem is simply
more efficient at storing data (the non-transactionness, and limited
metadata being
the tradeoff).

I don't say "all in DB" is fundamentally wrong.  It's just that its
niches are "low
bandwidth services", like intranets or rarely visited data, some forms of "very
important data store" services, where one must be absolutely sure about
safety and integrity, and finally, let's call it "academic/hobbyst research" ;-)

files outside of DB are where one must squeeze performance out of hardware
and it comes at a price.  If you can afford the price, you get the prize. ;)

   Regards,
        Dawid

Re: Database versus filesystem for storing images

From
Clodoaldo
Date:
2007/1/6, Maurice Aubrey <maurice.aubrey@gmail.com>:
> Clodoaldo wrote:
> > But the main factor to push me in the file system direction is the
> > HTTP cache management. I want the internet web clients and proxies to
> >  cache the images. The Apache web server has it ready and easy. If
> > the images where to be stored in the DB I would have to handle the
> > HTTP cache headers myself. Another code layer. Not too big a deal,
> > but if Apache give me it for free...
>
> There's a hybrid approach which has worked well for us.
>
> You store the binary data in the database along with a signature.
>
> On the Apache side, you write a 404 handler that, based on the request,
> fetches the binary from the database and writes it locally to the
> filesystem based on the signature (using a multi-level hashing scheme
> possibly as detailed in previous posts).
>
> When a request comes in to Apache, if the file exists it is served
> directly without any db interaction. OTOH, if it's missing, your 404
> handler kicks in to build it and you get a single trip to the db.
>
> You get the benefits of keeping the data in the db (transaction
> semantics, etc.) but also get the scalability and caching benefits
> of having the front-end webservers handle delivery.
>
> If you lose the locally cached data it's not an issue. They'll be
> faulted back into existence on demand.
>
> With multiple webservers, you can just allow the data to be cached on
> each machine, or if there's too much data for that, have your load
> balancer divide the requests to different webserver pools based on the
> signature.
>
> As an extension, if you need different versions of the data (like
> different sizes of an image, etc.), you can modify your URLs to indicate
> the version wanted and have the 404 handler take that into account when
> building them. You only store the original content in the database but
> could have any number of transformed versions on the webservers. Again,
> losing those versions is not an issue and do not require backup.

Very interesting approach. And I think it is also original as I have
not seen any mention of it. Thanks for sharing it.

--
Clodoaldo Pinto Neto

Re: Database versus filesystem for storing images

From
Andrew Chernow
Date:
 >>>apache has very good page and image caching.  You could take advantage
 >>>of that using this technique.

 > I wonder why this HTTP cache headers argument didn't surface in this
 > heated debate.

I did other up this argument by the way.

Andrew


Clodoaldo wrote:
> 2007/1/5, Jorge Godoy <jgodoy@gmail.com>:
>> Andrew Chernow <pg-job@esilo.com> writes:
>> > meet those requirements.  It is far more effecient to have apache
>> access
>> > them
>>
>> Where weren't we meeting his/her requirements?  All the discussion is
>> around
>> available means to do that.  One option is having the files on the
>> database,
>> the other is on the filesystem.  From my understanding we're
>> discussing the
>> benefits of each one.  Aren't we?
>
> Yes, although I suggested two solutions I asked for anything that
> would be considered the best practice. Now I think there is not a best
> practice or better, there should be one best practice for each of the
> solutions.
>
> I have done an intranet application that stored images in the
> database. It worked perfectly and I used the same engine in another
> intranet application to store not only images but any document which
> also worked perfectly.  The decision to go the dabatase only route was
> easy: The filesystem space would have to be negotiated while the space
> occupied by the databases were not controlled and used an advanced
> storage solution that gave lots of terabytes to be used at will. Also
> the any document application should not loose a single document and
> access control should be strictly enforced which was much easier to do
> with the database since I had no control over the webserver and even
> if I had I think the database access is still easier to control than
> the filesystem access. That was in a corporate intranet.
>
> What I'm doing now is an internet application. While the FS x DB
> synchronicity is very important in some kinds of document management,
> it is not in this application. Indeed if a few images are lost each
> day it has no meaning in a 500K to 1M inventory. The offended clients
> just upload them again. No one will be sued. The images are all
> public. No need to control the access.
>
> But the main factor to push me in the file system direction is the
> HTTP cache management. I want the internet web clients and proxies to
> cache the images. The Apache web server has it ready and easy. If the
> images where to be stored in the DB I would have to handle the HTTP
> cache headers myself. Another code layer. Not too big a deal, but if
> Apache give me it for free...
>
> I wonder why this HTTP cache headers argument didn't surface in this
> heated debate. Aren't DB developers/admins aware of the internet
> client's bandwidth limitations? Or they just assume the application
> would handle the HTTP cache headers? In the applications I created for
> intranet bandwidth was almost a non issue and I didn't care to make
> them bandwidth efficient, but for the internet the problem is there
> and it is big.
>
> Regards,

Re: Database versus filesystem for storing images

From
Maurice Aubrey
Date:
Clodoaldo wrote:
> But the main factor to push me in the file system direction is the
> HTTP cache management. I want the internet web clients and proxies to
>  cache the images. The Apache web server has it ready and easy. If
> the images where to be stored in the DB I would have to handle the
> HTTP cache headers myself. Another code layer. Not too big a deal,
> but if Apache give me it for free...

There's a hybrid approach which has worked well for us.

You store the binary data in the database along with a signature.

On the Apache side, you write a 404 handler that, based on the request,
fetches the binary from the database and writes it locally to the
filesystem based on the signature (using a multi-level hashing scheme
possibly as detailed in previous posts).

When a request comes in to Apache, if the file exists it is served
directly without any db interaction. OTOH, if it's missing, your 404
handler kicks in to build it and you get a single trip to the db.

You get the benefits of keeping the data in the db (transaction
semantics, etc.) but also get the scalability and caching benefits
of having the front-end webservers handle delivery.

If you lose the locally cached data it's not an issue. They'll be
faulted back into existence on demand.

With multiple webservers, you can just allow the data to be cached on
each machine, or if there's too much data for that, have your load
balancer divide the requests to different webserver pools based on the
signature.

As an extension, if you need different versions of the data (like
different sizes of an image, etc.), you can modify your URLs to indicate
the version wanted and have the 404 handler take that into account when
building them. You only store the original content in the database but
could have any number of transformed versions on the webservers. Again,
losing those versions is not an issue and do not require backup.

Maurice

Re: Database versus filesystem for storing images

From
"imageguy"
Date:
Clodoaldo wrote:
> 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>:
> >
> > I think I know the answer,
>
> If you know the answer please tell it as I have read some discussions
> on the web and although I have decided on a solution I'm still not
> sure about the best answer, if there is a best answer after all.

Sorry, didn't mean to sound like and expert on this, I am actually
quite a newbie.  From all of the discussions I have read and even the
ones in this thread, including your own comments below, it would seem
that to store the files in the files system you need some sort of
application erver or :middleware - like a webserver - to handle the
retreiving and serving of the files.
My organization is developing a commercial application for "document
tracking". It is not a Browser application, but rather a more
traditional "windows" thick client app.

At the present time we do not intend to deploy any sort of "application
server" - web server, ftp server, and not all of the workstations will
have access to a consistent network share.

So in this case, it is my understanding that our only real choice is to
store the documents and images in the database itself.

... unless someone knows of a postgresql function that would allow us
to "server" the file from the filesystem via the dbserver ??


>
> > but if you don't have an "application
> > server" - ie a webserver, etc,
>
> Yes I have an application server, the Apache server.
>
> > and many of the workstations/clients
> > that need access to the images but may not have access to a network
> > share,
>
> network share? I don't understand. The images will be loaded by html
> pages with the img tag like in <img
> src="http://domain.com/images/xxx.jpg">
>
> > isn't the database the only choice ?
>
> No. It is one of the choices. The other is to store the images in the
> file system, in a directory readable by Apache.

See above.  WE are trying to reduce the dependancies on other
applications to ensure a simple deployment of the application.


Re: Database versus filesystem for storing images

From
John McCawley
Date:
If you end up storing the pictures on a separate database server for
performance reasons, Jorge's argument regarding integrity becomes
moot...And his argument so far is the strongest case I've seen for
storing the files in the database.

Scott Marlowe wrote:

>On Fri, 2007-01-05 at 15:54, Jeremy Haile wrote:
>
>
>>Yeah - it can make it easier to implement transactional semantics by
>>storing them in the database, although for simple operations it wouldn't
>>be hard to replicate this manually.  And you are going to incur a
>>performance penalty by storing them in the database.
>>
>>Another thing to consider is that storing them in the file system makes
>>it much easier to browse the images using third-party tools, update
>>them, archive them (by gzipping or whatever).  This is much more
>>difficult if they are stored in the database.
>>
>>
>
>The biggest performance penalty from them being in the same database as
>your data is that they're going to be the majority of your kernel /
>database buffers.  So, as mentioned earlier, it's almost a necessity
>(for performance reasons) to put them into their own db server.
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: don't forget to increase your free space map settings
>
>

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: Database versus filesystem for storing images

From
Maurice Aubrey
Date:
Clodoaldo wrote:
> But the main factor to push me in the file system direction is the
> HTTP cache management. I want the internet web clients and proxies to
>  cache the images. The Apache web server has it ready and easy. If
> the images where to be stored in the DB I would have to handle the
> HTTP cache headers myself. Another code layer. Not too big a deal,
> but if Apache give me it for free...

There's a hybrid approach which has worked well for us.

You store the binary data in the database along with a signature.

On the Apache side, you write a 404 handler that, based on the request,
fetches the binary from the database and writes it locally to the
filesystem based on the signature (using a multi-level hashing scheme
possibly as detailed in previous posts).

When a request comes in to Apache, if the file exists it is served
directly without any db interaction. OTOH, if it's missing, your 404
handler kicks in to build it and you get a single trip to the db.

You get the benefits of keeping the data in the db (transaction
semantics, etc.) but also get the scalability and caching benefits
of having the front-end webservers handle delivery.

If you lose the locally cached data it's not an issue. They'll be
faulted back into existence on demand.

With multiple webservers, you can just allow the data to be cached on
each machine, or if there's too much data for that, have your load
balancer divide the requests to different webserver pools based on the
signature.

As an extension, if you need different versions of the data (like
different sizes of an image, etc.), you can modify your URLs to indicate
the version wanted and have the 404 handler take that into account when
building them. You only store the original content in the database but
could have any number of transformed versions on the webservers. Again,
losing those versions is not an issue and do not require backup.

Maurice


Re: Database versus filesystem for storing images

From
Scott Ribe
Date:
> My point is: if I need to be 100% sure that what is referenced on the database
> is accessible all the time when the reference is, then I need to have this on
> the database

Not necessarily. It does take carefully controlling access, with a good deal
of thought and error-checking on the part of the code that has write access
to the files, but it can certainly be done.

> But if there are other people touching
> things -- moving / creating / deleting / changing files and directories --
> then things get more complicated to manage.

Absolutely. But allowing the kinds of tools mentioned earlier for examining
files does not require giving anyone write access ;-)


--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice