Thread: Where to store Blobs?
Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list. Now I realized: Nobody talked about Blobs. I guess most people do not store Blobs in PostgresSQL. Where do you store Blobs? (In my case Blobs are PDF/image files with size up to 20 MByte. I do not talk about very big blobs which are several hundret MByte) -- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines
On 3/13/19 7:28 AM, Thomas Güttler wrote: > Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing > list. > > Now I realized: Nobody talked about Blobs. > > I guess most people do not store Blobs in PostgresSQL. > > Where do you store Blobs? Probably the preferred method: https://www.postgresql.org/docs/11/datatype-binary.html Another method: https://www.postgresql.org/docs/11/largeobjects.html > > (In my case Blobs are PDF/image files with size up to 20 MByte. > I do not talk about very big blobs which are several hundret MByte) > > > -- Adrian Klaver adrian.klaver@aklaver.com
Thomas Güttler wrote: > Now I realized: Nobody talked about Blobs. > > I guess most people do not store Blobs in PostgresSQL. > > Where do you store Blobs? > > (In my case Blobs are PDF/image files with size up to 20 MByte. > I do not talk about very big blobs which are several hundret MByte) Ideally outside the database, if they are many. Large databases are harder to backup than large file systems. If you keep 20MB binaries in the database, you'd use the "bytea" data type. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
On 3/13/19 9:28 AM, Thomas Güttler wrote: > Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing > list. > > Now I realized: Nobody talked about Blobs. > > I guess most people do not store Blobs in PostgresSQL. > > Where do you store Blobs? > > (In my case Blobs are PDF/image files with size up to 20 MByte. > I do not talk about very big blobs which are several hundret MByte) We store PDFs and TIFFs of dozens to a few hundred KB in the database as bytea. -- Angular momentum makes the world go 'round.
I store them as bytea in the database despite the fact that there are benefits to storing them in the file system. The reason is that it is easier to secure access to the database than to secure both the database and provide secure access to the file system.
Chuck Martin
Avondale Software
On Wed, Mar 13, 2019 at 10:34 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
Thomas Güttler wrote:
> Now I realized: Nobody talked about Blobs.
>
> I guess most people do not store Blobs in PostgresSQL.
>
> Where do you store Blobs?
>
> (In my case Blobs are PDF/image files with size up to 20 MByte.
> I do not talk about very big blobs which are several hundret MByte)
Ideally outside the database, if they are many.
Large databases are harder to backup than large file systems.
If you keep 20MB binaries in the database, you'd use the "bytea" data type.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On Wed, 13 Mar 2019 at 10:27, Thomas Güttler <guettliml@thomas-guettler.de> wrote: > I guess most people do not store Blobs in PostgresSQL. > > Where do you store Blobs? Things have changed, but at one time, we were using RT as our ticketing system (https://bestpractical.com/request-tracker) and it would capture documents as database objects. The table where RT stowed downloadable documents was one of the largest tables in the database because of there being a few 50MB copies of Access Databases and some many-MB spreadsheets in there. It worked fine; no problems evident from it. It was certainly surprising to find such large documents there, and if people had gotten in the habit of putting GBs of data into RT, that would have probably led to some policy changes to prevent it, but unless you're pretty actively trying to blow the system up, it just works. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On Wed, Mar 13, 2019 at 11:50:37AM -0400, Christopher Browne wrote: >> I guess most people do not store Blobs in PostgresSQL. - BYTEA puts practical limits on size - LO storage happens inside the system (!) table Nowadays, there are Foreign Data Wrappers which might encapsulate files as if they lived inside the database. Also, a combination of COPY TO FORMAT binary pg_read_binary_file() and suitable plpgsql security definer functions might provide for a Poor Man's binary file integrated external storage. Karsten -- GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B
On 13/3/19 4:28 μ.μ., Thomas Güttler wrote: > Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list. > > Where do you store Blobs? > > (In my case Blobs are PDF/image files with size up to 20 MByte. > I do not talk about very big blobs which are several hundret MByte) > bytea. Keeping those in the filesys and trying to keep filesys in sync with the db is a PITA. Also dont know what happensin the transactional dept (ACID) with lo_* large objects. In most cases bytea are just fine. > > -- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
Hi, Am 13.03.19 um 15:28 schrieb Thomas Güttler: > > Where do you store Blobs? Within PostgreSQL, of course. The system I have in mind stores ZIP and PDF files, usually a few MBs each; we're currently at a total of about 100 GB and there are no evident problems. For this application, it is extremely important that saving/modifying the binaries is part of a larger transaction that completes/fails atomically. Replication is almost real-time and access to the binaries is provided with the same libraries (JDBC/Hibernate) as everything else. Best regards, -hannes
Make dump/restore of database data unnecessarily expensive in terms of time and space imho.
I store large models in the database because I need to have a historical data to compare to. That said, I could probably also automate a git repo but it will be just that much more work and git with binary files really doesn't make sense. Storage is really cheap and I assume the database stores bytesa types like they do text within a separate and partitioned section of the database.
Thanks,
~Ben
On Wed, Mar 13, 2019 at 1:37 PM Jamesie Pic <jpic@yourlabs.org> wrote:
Make dump/restore of database data unnecessarily expensive in terms of time and space imho.
I don't think there is a suitable 'one size fits all' answer to this question. A lot will depend on how you intend to use the blobs and what sort of hardware architecture, especially storage systems, you have. At first glance, sticking everything in the DB seems like an easy choice. However, that can result in very large databases, which in turn can lead to issues with respect to backup, replication etc. If all your after is storage, then sometimes your better off using the file system for the blobs and keeping the metadata in the db. It can potentially be faster and easier to serve up blobs from the file system compared to the db if that is the main use case, but if the blobs are more dynamic or you use collections of blobs to build a master blob etc, the db has some advantages. If you really need database like functionality, given the relative cheapness of storage and the wealth of options available, storing the blobs in the database can have advantage. However, it will be important to select the most appropriate datatype. What some people think of as a 'blob' is just an array of bytes to many DBs and as usual, you need to make the decision as to what is the best storage representation for your requirements, keeping in mind that the more general 'blob' like storage type you choose often represents a loss in functionality but an increase in flexibility wrt to what can be inserted over more precise data types, which will be more restrictive about what can be inserted, but offer more functionality regarding what you can do with it (at the db level). Tim Thomas Güttler <guettliml@thomas-guettler.de> writes: > Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list. > > Now I realized: Nobody talked about Blobs. > > I guess most people do not store Blobs in PostgresSQL. > > Where do you store Blobs? > > (In my case Blobs are PDF/image files with size up to 20 MByte. > I do not talk about very big blobs which are several hundret MByte) -- Tim Cross
I am curious. Why did nobody say: store blobs in a storage server (like s3) and only store the blob-id in PostgreSQL? Regards, Thomas Am 13.03.19 um 15:28 schrieb Thomas Güttler: > Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list. > > Now I realized: Nobody talked about Blobs. > > I guess most people do not store Blobs in PostgresSQL. > > Where do you store Blobs? > > (In my case Blobs are PDF/image files with size up to 20 MByte. > I do not talk about very big blobs which are several hundret MByte) > > > -- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines
On Mon, Mar 18, 2019 at 11:42:00AM +0100, Thomas Güttler wrote: > I am curious. Why did nobody say: > > store blobs in a storage server (like s3) and only store the blob-id in PostgreSQL? That's been rehashed to the point of becoming a FAQ https://wiki.postgresql.org/wiki/BinaryFilesInDB Karsten Hilbert -- GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B
Storing files in the database has already been demonstrated as not an efficient trade-off [0]. As such, this post provides a subjective story about files in the database, rather than an objective compare.
So, I have this ridiculous situation where dumping a database with 12k user-submitted forms with files, made by 3000 users, which I hope we can agree “is absolutely nothing”, yet, it eats 35G of PG data.
As a result, dumping the database takes already 32 minutes and is extremely write-intensive, causing more than 5 second disk backlogs during the operation at the same time.
If I had stored file paths, like I should have, the db would take a few MBs and the dump a few seconds even less than one second ?
Also, the backup could just read the file tree from the file system and synchronize only new files: we don’t have programmatic modifications of written files for legal reasons: which makes it even more ridiculous to dump them every time we do a backup.
As such, my biggest regret is to have listened to the manager who imposed this design decision, but I trusted him at the time I was a fresh hire.
Today, we have the choice of two poisons for deployment:
not shutting down the service during the backup, to save 32 minutes of downtime, but that’s 32 minutes of writes that are at risk in the case of a faulty destructive operation, yes that doesn’t happen 99% of the time,
shutting down the service during the backup, as we should, but that means 32 minutes of extra downtime
In my experience, storing files in the database is a critical mistake. I mean, if you know what you’re doing maybe, and I thought the said manager would know what he’s doing.
But at the end of a year the manager decided to ignore all monitoring alarms concerning disk space that were raised during the backup. As such, I kept going and freeing as much space as possible when a backup failed to happen, which hopefully canceled the full deployment, leaving the service online, even though with a full disk.
I have raised the warning to the customer for months and but the manager kept insisting that we close our eyes on it, and kept on doing palliative fixes when needed.
Of course, piling up palliatives fixes in production eventually created the situation where the disk was too full to make a backup. The manager that had installed the server OS had put 3 disks in a RAID1 array with extremely poor partitioning.
As such, i had to spend a night repartitioning the RAID1, so that the / would be on 10G instead of 40G. Which kept us going a bit more, but unfortunnately one week shorter than expected, because I had forgot to include the exponential growth in the math of the estimation.
Leading to even more incidents. If you store files in the database, which you shouldn’t unless you’re 100% sure about what you’re doing, then do not ignore disk space warnings during backups. Or else … well what do you think happens when an airplane pilot ignores the alarms on their dashboard ?
99% of incidents are a suite of predictable events.
So, I have this ridiculous situation where dumping a database with 12k user-submitted forms with files, made by 3000 users, which I hope we can agree “is absolutely nothing”, yet, it eats 35G of PG data.
As a result, dumping the database takes already 32 minutes and is extremely write-intensive, causing more than 5 second disk backlogs during the operation at the same time.
If I had stored file paths, like I should have, the db would take a few MBs and the dump a few seconds even less than one second ?
Also, the backup could just read the file tree from the file system and synchronize only new files: we don’t have programmatic modifications of written files for legal reasons: which makes it even more ridiculous to dump them every time we do a backup.
As such, my biggest regret is to have listened to the manager who imposed this design decision, but I trusted him at the time I was a fresh hire.
Today, we have the choice of two poisons for deployment:
not shutting down the service during the backup, to save 32 minutes of downtime, but that’s 32 minutes of writes that are at risk in the case of a faulty destructive operation, yes that doesn’t happen 99% of the time,
shutting down the service during the backup, as we should, but that means 32 minutes of extra downtime
In my experience, storing files in the database is a critical mistake. I mean, if you know what you’re doing maybe, and I thought the said manager would know what he’s doing.
But at the end of a year the manager decided to ignore all monitoring alarms concerning disk space that were raised during the backup. As such, I kept going and freeing as much space as possible when a backup failed to happen, which hopefully canceled the full deployment, leaving the service online, even though with a full disk.
I have raised the warning to the customer for months and but the manager kept insisting that we close our eyes on it, and kept on doing palliative fixes when needed.
Of course, piling up palliatives fixes in production eventually created the situation where the disk was too full to make a backup. The manager that had installed the server OS had put 3 disks in a RAID1 array with extremely poor partitioning.
As such, i had to spend a night repartitioning the RAID1, so that the / would be on 10G instead of 40G. Which kept us going a bit more, but unfortunnately one week shorter than expected, because I had forgot to include the exponential growth in the math of the estimation.
Leading to even more incidents. If you store files in the database, which you shouldn’t unless you’re 100% sure about what you’re doing, then do not ignore disk space warnings during backups. Or else … well what do you think happens when an airplane pilot ignores the alarms on their dashboard ?
99% of incidents are a suite of predictable events.
I forgot to mention that my deployments include automated migrations as often as possible, sometimes destructive for refactoring purpose, as such, to maintain PostgreSQL on a basic linux box I am:
- for having an automated backup prior in the automated deployment script that may play destructive migrations,
- against the needless overhead of coupling both binary and relational data in operations that slows the whole thing down or makes it less reliable
Also got supposedly many new points against, mixed with more detail on the points briefly exposed in my previous email, going deeper in detail, about how it fits in the big picture of my personal practice ... and how this has destabilized my prod for months:
tl;dr
If you store media files in PostgreSQL on a production server, then do take disk space alarms seriously even if they happen only during backups.
Otherwise I fail to see how to avoid a pattern of recurring incidents, "manually unblocking automated deployments" (double debt interest cost because also defeats the purpose of automating deployment), when not filling up a disk during the nightly backup dump ...
Hope this helps,
Have a great day
Good question, and there are some excellent thoughts and cautionary tales in the thread already. I've faced this question several times down the years and, as others have said, the best answer depends. Roughly speaking, I can think of three obvious places to store documents:
* A database.
* A file system.
* A commodity file-system-like cloud service, such as S3, B2 Cloud, Azure, or Google Cloud.
There are pros and cons to each, and circumstances that will make one option or another untenable. It also depends a lot on your overall architecture, if documents can change, how your searches work, your security/regulatory requirements, the overall workflow, etc.
But given all of that, I hate bloating a database with BLOBs that aren't themselves searchable. That's what file systems are for. This past week, I've been coding against the B2 Cloud API and it's quite nice...simpler than S3, and much cheaper. In this setup, we keep a document catalog in the database that ties together a source record and an external document. The "cloud_doc" record contains identifiers and information from both parts of this story. So, descriptors, tags and IDs needed when you're looking for something. In our case, that's large XML log files. These logs are either of zero value or great value. We need to keep them, but 99.99999% of the time they're pure bloat. We've got a record about the event the log details. The cloud_doc record includes a reference back to the original, and to the storage location of the full log. So, the service type, bucket ID, document ID, etc. So the catalog record links our real data with the remote document, same as you would storing file paths to a local file tree.
Storing the data externally has some risks and limitations that make it unsuitable in some situations. For example, you can't include the file or cloud system in a transaction in any normal sense. Also, depending on platform, there may be an unacceptable lag between pushing a file up and it becoming visible to other systems. But for common cases, external (file or cloud) storage can be pretty great.
Likewise, it's kind of scary to have the files someplace else where Bad Things Might Happen and your database and file tree are out of sync. That's genuinely worrisome. It's likely overkill, but I like to store references back into the database in the meta-data for the document up on the cloud. On S3 and B2, you store meta-data as "tags". I haven't used the document storage systems on Azure or Google, but they likely have similar functionality. A "tag" up on the cloud repository is a name-value-pair that you can add custom data to. So, for example, I'll store the ID of our cloud document catalog record in the meta-data (tags) for the document on the cloud service. I also stash the ID of the source record that the log relates to. If all goes well, we'll never need these tags but if worst came to worst, they provide a way of calculating the parents of each external document.
That last paragraph about tags points out something important: Cloud document storage can be preferable to a standard file system for reasons other than simplified provisioning and cost. Many file systems don't allow your readily add and search meta-data like this, whereas it's a quite accessible feature of cloud storage systems.
Going in another direction entirely, there may be times when what you need is a local file storage system for documents. File systems are...scary, but what about SQLite? You have a single file with a modern SQL syntax where you can stuff BLOBs. You don't bloat your Postgres data file, but still have a database for the documents instead of a file system. SQLite only looks to solve problems in a pretty specific set of cases but, as it turns out, those cases are also quite common. Combining it with a Postgres setup seems pretty exotic, but there might be a time. It depends on your setup.
On Sat, Apr 20, 2019 at 10:59 AM Jamesie Pic <jpic@yourlabs.org> wrote:
I forgot to mention that my deployments include automated migrations as often as possible, sometimes destructive for refactoring purpose, as such, to maintain PostgreSQL on a basic linux box I am:- for having an automated backup prior in the automated deployment script that may play destructive migrations,- against the needless overhead of coupling both binary and relational data in operations that slows the whole thing down or makes it less reliableAlso got supposedly many new points against, mixed with more detail on the points briefly exposed in my previous email, going deeper in detail, about how it fits in the big picture of my personal practice ... and how this has destabilized my prod for months:tl;drIf you store media files in PostgreSQL on a production server, then do take disk space alarms seriously even if they happen only during backups.Otherwise I fail to see how to avoid a pattern of recurring incidents, "manually unblocking automated deployments" (double debt interest cost because also defeats the purpose of automating deployment), when not filling up a disk during the nightly backup dump ...Hope this helps,Have a great day
Thanks for your feedback. In my case, Bob the manager said this would "make backups easier" xD
The general pro that I see is "transactional". While I can understand that, it's irrelevant in our case: users upload files through AJAX, that happens **before** they submit the form. That means, the file gets saved in an HTTP exchange that happens **before** the actual data insertion transaction.
As such, be careful too that this argument does not be irrelevant in your case like it is in mine.