Thread: Best Strategy for Large Number of Images

Best Strategy for Large Number of Images

From
Estevan Rech
Date:
Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a question about the best way/strategy for storing images. I have about 2 million images with a growth trend of around 1 million images per year. I plan to store in bytea format in an isolated table. Is this recommended? Is there another better way?

Re: Best Strategy for Large Number of Images

From
Rob Sargent
Date:
On 12/15/21 10:45, Estevan Rech wrote:
Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a question about the best way/strategy for storing images. I have about 2 million images with a growth trend of around 1 million images per year. I plan to store in bytea format in an isolated table. Is this recommended? Is there another better way?

How big are the images on average?

Re: Best Strategy for Large Number of Images

From
Estevan Rech
Date:
350kb

Em qua., 15 de dez. de 2021 às 14:50, Rob Sargent <robjsargent@gmail.com> escreveu:
On 12/15/21 10:45, Estevan Rech wrote:
Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a question about the best way/strategy for storing images. I have about 2 million images with a growth trend of around 1 million images per year. I plan to store in bytea format in an isolated table. Is this recommended? Is there another better way?

How big are the images on average?

Re: Best Strategy for Large Number of Images

From
"David G. Johnston"
Date:

On Wed, Dec 15, 2021 at 10:57 AM Estevan Rech <softrech@gmail.com> wrote:
350kb

Em qua., 15 de dez. de 2021 às 14:50, Rob Sargent <robjsargent@gmail.com> escreveu:
On 12/15/21 10:45, Estevan Rech wrote:
Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a question about the best way/strategy for storing images. I have about 2 million images with a growth trend of around 1 million images per year. I plan to store in bytea format in an isolated table. Is this recommended? Is there another better way?

How big are the images on average?

Please don't top-post - inline or bottom posting (with trimming) is the custom here.

It does depend (both on the size and usage patterns).

An alternative way is to simply not store images within the database itself.  Store a pointer, and then store the image externally.  The application then gets the pointer from the database and uses it to retrieve the image from the image server.

David J.

Re: Best Strategy for Large Number of Images

From
Estevan Rech
Date:
I think about using it inside the database to facilitate the select and generation of reports...

Re: Best Strategy for Large Number of Images

From
Adrian Klaver
Date:
On 12/15/21 10:17 AM, Estevan Rech wrote:
> I think about using it inside the database to facilitate the select and 
> generation of reports...
> 

As long as the meta data about the images is stored in the database that 
is still possible.


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Best Strategy for Large Number of Images

From
Estevan Rech
Date:
But PostgreSQL can handle it? Does it have good search performance? and is memory usage feasible?

Re: Best Strategy for Large Number of Images

From
Adrian Klaver
Date:
On 12/15/21 10:48 AM, Estevan Rech wrote:
> But PostgreSQL can handle it? Does it have good search performance? and 
> is memory usage feasible?

Search on what?

-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Best Strategy for Large Number of Images

From
"David G. Johnston"
Date:
On Wednesday, December 15, 2021, Estevan Rech <softrech@gmail.com> wrote:
But PostgreSQL can handle it? Does it have good search performance? and is memory usage feasible?

Images of that size won’t be stored on the main table so performance when not asking for image data should be normal.  Memory is one of those usage patterns and hardware “it depends” things.  Probably it will be OK.

Personally, it is easier, and less complex, than the alternative.  Do it and migrate later if issues arise.

David J.

Re: Best Strategy for Large Number of Images

From
Estevan Rech
Date:
The possibilities are known, but does anyone have experience with this scenario?

Re: Best Strategy for Large Number of Images

From
Ron
Date:
On 12/15/21 11:45 AM, Estevan Rech wrote:
Hi,

I don't currently use PostgreSQL, but I plan to migrate and I have a question about the best way/strategy for storing images. I have about 2 million images with a growth trend of around 1 million images per year. I plan to store in bytea format in an isolated table. Is this recommended? Is there another better way?

We have an application which store images in the database using bytea fields.  The next version of the software will store the images externally, with the filename in the database, because filesystems just inherently do a better job at storing lots of files than does a database.

--
Angular momentum makes the world go 'round.

Re: Best Strategy for Large Number of Images

From
Rob Sargent
Date:
On 12/15/21 12:22, Estevan Rech wrote:
The possibilities are known, but does anyone have experience with this scenario?
My previous employer dealt with radiology images.  These were in the file system with an image server as suggested up thread.

Re: Best Strategy for Large Number of Images

From
Adrian Klaver
Date:
On 12/15/21 11:22, Estevan Rech wrote:
> The possibilities are known, but does anyone have experience with this 
> scenario?

It would help if you elaborated on the scenerio is?

There is no questioning that you can store images in the database. The 
issues are what you want to do with them once they are there, the 
hardware(real or virtual) you have available, the user load on the 
database, etc?

Providing at least an outline that answers the above would go a long way 
to getting some reasonable answers based on experience.

-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Best Strategy for Large Number of Images

From
Estevan Rech
Date:
Adrian, 

I have an application that takes pictures of items and uses them as evidence in a report.

These images are saved and then used to generate a report.

Each item averages 10 photos and I have about 2 million photos currently, with an average growth of 1 million photos over the next year.

I think about using it within the database because of the ease of selecting records with the content of the images (when necessary).

I think my biggest concern is to continue to use this strategy for the next 2 years and encounter a PostgreSQL limitation or some failure and have to migrate database again or have significant loss of images. And on the contrary too, if I use it on disk and in 2 years I find some failure or file limit in the operating system (currently it's windows server 2016)...

Related to this is the backup service that seems to me that in a disaster, I can restore the database relatively quickly if it's in the database. On the disk, I believe that the restoration is much slower and I don't think it's reliable that all the images are copied/restored.

Re: Best Strategy for Large Number of Images

From
Gavan Schneider
Date:
On 16 Dec 2021, at 7:12, Estevan Rech wrote:

> I have an application that takes pictures of items and uses them as
> evidence in a report.
>
> These images are saved and then used to generate a report.
>
Once you start talking reports and evidence I think audit trail and verification a.k.a. “chain of evidence”.
With the images in the database they can be part of your information ecosystem, get a datestamp along with checksum and
allthis included in the audit trail. 
All this can be done with the images in the file system and the database holds the pointers only… it just may not be as
easyto convince the oppositions lawyers that it is tamper proof, except the best you can really get is tamper evident.
😉

> Each item averages 10 photos and I have about 2 million photos currently,
> with an average growth of 1 million photos over the next year.
>
> I think about using it within the database because of the ease of selecting
> records with the content of the images (when necessary).
>
That will be the same however you go. The selection will be on the context data you enter with the images. There has
beenno mention of searching within the image itself, and this is not something PG can do “out of the box” anyway. 

> I think my biggest concern is to continue to use this strategy for the next
> 2 years and encounter a PostgreSQL limitation or some failure and have to
> migrate database again or have significant loss of images. And on the
> contrary too, if I use it on disk and in 2 years I find some failure or
> file limit in the operating system (currently it's windows server 2016)...
>
The total image data will end up somewhere less than 4Tb if your estimates continue to hold. That’s not going to stress
PG.It’s not such a big deal on a modern home computer either, but I can’t make any informed comment on Windows
anything.Realistically, once the data is in the database the whole can be moved to a more capable machine and/or OS if
problemsstart to appear. PG will work wherever you deploy it. 

> Related to this is the backup service that seems to me that in a disaster,
> I can restore the database relatively quickly if it's in the database. On
> the disk, I believe that the restoration is much slower and I don't think
> it's reliable that all the images are copied/restored.
>
If you have a reliable tested backup system for the database and don’t want to add to the workload then get the images
intothe database. If there are reasons why you end up with the images outside the database then, of course, the backup
processhas to be able to keep everything together and guarantee a full restore. Doable as well, but it does add some
complexity.

If I am reading this correctly the use case is one of receiving data in the form of photos and field reports, entering
thereport items associated with the photos into the information system, generating a report (with embedded photos), and
keepingrecords so disputes can be handled. This is not a high volume access to the images so there is no real need for
optimumfilesystem speed to serve the images… keeping them in the database as bytea is perfectly workable and will work
fordata volumes well beyond the current estimates. 

Gavan Schneider
——
Gavan Schneider, Sodwalls, NSW, Australia
Explanations exist; they have existed for all time; there is always a well-known solution to every human problem —
neat,plausible, and wrong. 
— H. L. Mencken, 1920



Re: Best Strategy for Large Number of Images

From
Marc Millas
Date:
Hi,
when I am doing this I store the data in one postgres cluster, with some kind of id for each image , and the images in another with the id as link.
The customer app is written so that it issues a dedicated http request for each image. (I use nginx to create a dedicated "path".
pro: the images dont over charge the shared buffer cache of the data db., its easier to cache the images if its relevant perf wise. backups are much easier on the data db as its much smaller, .very easy to scale horizontaly by adding dedicated images db, .etc...
con: as with all 2 storage solution, even the disk option, you "may" encounter phase pb ie. id existing in one db and not in the other. but this depends on how you maintain the images.

my 2 cents


Marc MILLAS
Senior Architect
+33607850334



On Wed, Dec 15, 2021 at 9:12 PM Estevan Rech <softrech@gmail.com> wrote:
Adrian, 

I have an application that takes pictures of items and uses them as evidence in a report.

These images are saved and then used to generate a report.

Each item averages 10 photos and I have about 2 million photos currently, with an average growth of 1 million photos over the next year.

I think about using it within the database because of the ease of selecting records with the content of the images (when necessary).

I think my biggest concern is to continue to use this strategy for the next 2 years and encounter a PostgreSQL limitation or some failure and have to migrate database again or have significant loss of images. And on the contrary too, if I use it on disk and in 2 years I find some failure or file limit in the operating system (currently it's windows server 2016)...

Related to this is the backup service that seems to me that in a disaster, I can restore the database relatively quickly if it's in the database. On the disk, I believe that the restoration is much slower and I don't think it's reliable that all the images are copied/restored.

Re: Best Strategy for Large Number of Images

From
Bruce Momjian
Date:
On Wed, Dec 15, 2021 at 02:45:15PM -0300, Estevan Rech wrote:
> Hi,
> 
> I don't currently use PostgreSQL, but I plan to migrate and I have a question
> about the best way/strategy for storing images. I have about 2 million images
> with a growth trend of around 1 million images per year. I plan to store in
> bytea format in an isolated table. Is this recommended? Is there another better
> way?

You might want to read this:

    https://momjian.us/main/blogs/pgblog/2020.html#June_19_2020

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Best Strategy for Large Number of Images

From
Benedict Holland
Date:
I would recommend storing a link to the file rather than the file itself. Other than that, use BLOBS. I would probably recommend not storing any binary objects in a database for a variety of reasons but if you have to then bytea will work, assuming they are smaller than 2gb. 

Thanks,
- Ben

On Wed, Dec 15, 2021, 4:41 PM Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Dec 15, 2021 at 02:45:15PM -0300, Estevan Rech wrote:
> Hi,
>
> I don't currently use PostgreSQL, but I plan to migrate and I have a question
> about the best way/strategy for storing images. I have about 2 million images
> with a growth trend of around 1 million images per year. I plan to store in
> bytea format in an isolated table. Is this recommended? Is there another better
> way?

You might want to read this:

        https://momjian.us/main/blogs/pgblog/2020.html#June_19_2020

--
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.



Re: Best Strategy for Large Number of Images

From
Bèrto ëd Sèra
Date:
> I don't currently use PostgreSQL, but I plan to migrate and I have a question 
> about the best way/strategy for storing images. I have about 2 million images
> with a growth trend of around 1 million images per year. I plan to store in bytea
> format in an isolated table. Is this recommended? Is there another better way?

Do you need to edit those images? Or do you need to make sure they are not edited, and if they are, you want to know who did it?

If so, I'd say make a versioned external repo (a git repo will do), and have your application store a chain of SHA-1s of the file commits. If anyone wants to alter your images behind your back, they'll need to gain access to both the DB and the repo, and know how you pair the two information levels (how deeply paranoid you need to be is up to your business requirements). 

If you just want to keep the images and no particular security is required, I'd say just store them in the DB. 1.000.000/year pictures are ~2/minute, if the flow is regular. As per your later posts they are not so big that they could not be managed, you don't seem to expect a lot of concurrent reads and doing it like this all you have to think of is ONE db backup/restore procedure. This might be more complex if you expect traffic peaks on the insert phase, of course.

Berto

Re: Best Strategy for Large Number of Images

From
"David G. Johnston"
Date:
On Wed, Dec 15, 2021 at 1:12 PM Estevan Rech <softrech@gmail.com> wrote:

Related to this is the backup service that seems to me that in a disaster, I can restore the database relatively quickly if it's in the database. On the disk, I believe that the restoration is much slower and I don't think it's reliable that all the images are copied/restored.

Everything ends up "on disk" at some point - whether it's a database or a simple file server - so speed and reliability would generally be a wash conceptually.

Honestly, with that more complete description (and the assumption these images are primary work products that are being paid for), I would say you should be doing both.  Use something like AWS S3 to store a copy - saving its path into the DB as metadata - while also leaving a copy within the database.  As mentioned nearby, also generate a hash and store that.  Institute a rule that once an image has been generated it is never altered.  If you need to deal with editing, create a new record, and maybe record the "source image hash" so you can build a chain if desired.

I'd probably store the image data on a separate table than the metadata - in a one-to-one relationship.  Maybe even partition the image file table.  That lets you drop images from the DB if desired - with the option to restore them from the AWS S3 system if needed.  You can over-engineer things this way but I do think that having two primary sources for the images, plus their backups, is reasonable for the types of images you are dealing with.

David J.

Re: Best Strategy for Large Number of Images

From
Дмитрий Иванов
Date:
PostgreSQL, thanks to extensions, allows a third non-obvious file strategy. Since many developers when working with small images don't want to overload their solutions with the complexities of interacting with file servers, and I didn't want to store files in a database, I found and tried a third option. I turned PostgreSQL into a file server that provides page bytes exchange with the client using Bytea virtual tables. Roughly it looks like this: client-(access library-asynchronous page exchange)-PostgreSQL-Python-file system. I've set a limit of 2Gb, but working with such large files with this strategy doesn't seem reasonable to me. But the system for storing fiscal documents turned out pretty good (<10Mb).
--
Regards, Dmitry!


ср, 15 дек. 2021 г. в 23:18, Estevan Rech <softrech@gmail.com>:
I think about using it inside the database to facilitate the select and generation of reports...

RE: Best Strategy for Large Number of Images

From
Zahir Lalani
Date:

Confidential

We have a large B2B application used in marketing management – heavily focused on images/video – no binaries ever stored in DB – we have all the metadata in the DB and binaries in scalable folder structure. Serving up images from the DB is not very scalable. I am not sure what you will do with the binary data that it is needed in the DB?

 

Z

 

From: Estevan Rech <softrech@gmail.com>
Sent: 15 December 2021 17:45
To: pgsql-general@lists.postgresql.org
Subject: Best Strategy for Large Number of Images

 

Hi,


I don't currently use PostgreSQL, but I plan to migrate and I have a question about the best way/strategy for storing images. I have about 2 million images with a growth trend of around 1 million images per year. I plan to store in bytea format in an isolated table. Is this recommended? Is there another better way?

 

Re: Best Strategy for Large Number of Images

From
Estevan Rech
Date:
How is this folder structure like 10,000 folders? and the backup of it, how long does it take?

Re: Best Strategy for Large Number of Images

From
Andreas Joseph Krogh
Date:
På mandag 20. desember 2021 kl. 11:29:57, skrev Estevan Rech <softrech@gmail.com>:
How is this folder structure like 10,000 folders? and the backup of it, how long does it take?
 
I recommend using SeaweedFS as blob-store, and store metadata (folder, size, filenames etc.) in DB. It has excellent HA and backup mechanisms.
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Best Strategy for Large Number of Images

From
Imre Samu
Date:
> ... I have about 2 million images ... 
folder structure 

The "Who's On First" gazetteer with ~ 26M geojson records - using 3-number chunks subfolder structure.
"Given a Who's On First ID its (relative) URI can be derived by splitting the ID in to 3-number chunks representing nested subdirectories, followed by filename consisting of the ID followed by .geojson.    For example the ID for Montréal is 101736545 which becomes:   101/736/545/101736545.geojson"
it is working .. but this is also not optimal 
"As of this writing it remains clear that this approach (lots of tiny files parented by lots of nested directories) can be problematic. We may be forced to choose another approach, like fewer subdirectories but nothing has been decided and anything we do will be backwards compatible." ( from https://whosonfirst.org/data/principles/ )
Now  the structure have been migrated to per-country repositories  ( https://whosonfirst.org/blog/2019/05/09/changes/ )

maybe you can adopt some ideas. 
imho:  with 3-number chunks representing nested subdirectories - you can choose more file systems / hosting solutions .. 

regards,
 Imre

Estevan Rech <softrech@gmail.com> ezt írta (időpont: 2021. dec. 20., H, 11:30):
How is this folder structure like 10,000 folders? and the backup of it, how long does it take?