Thread: Where to store Blobs?

Where to store Blobs?

From
Thomas Güttler
Date:
Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list.

Now I realized: Nobody talked about Blobs.

I guess most people do not store Blobs in PostgresSQL.

Where do you store Blobs?

(In my case Blobs are PDF/image files with size up to 20 MByte.
I do not talk about very big blobs which are several hundret MByte)



-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


Re: Where to store Blobs?

From
Adrian Klaver
Date:
On 3/13/19 7:28 AM, Thomas Güttler wrote:
> Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing 
> list.
> 
> Now I realized: Nobody talked about Blobs.
> 
> I guess most people do not store Blobs in PostgresSQL.
> 
> Where do you store Blobs?

Probably the preferred method:
https://www.postgresql.org/docs/11/datatype-binary.html

Another method:
https://www.postgresql.org/docs/11/largeobjects.html

> 
> (In my case Blobs are PDF/image files with size up to 20 MByte.
> I do not talk about very big blobs which are several hundret MByte)
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Where to store Blobs?

From
Laurenz Albe
Date:
Thomas Güttler wrote:
> Now I realized: Nobody talked about Blobs.
> 
> I guess most people do not store Blobs in PostgresSQL.
> 
> Where do you store Blobs?
> 
> (In my case Blobs are PDF/image files with size up to 20 MByte.
> I do not talk about very big blobs which are several hundret MByte)

Ideally outside the database, if they are many.
Large databases are harder to backup than large file systems.

If you keep 20MB binaries in the database, you'd use the "bytea" data type.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



Re: Where to store Blobs?

From
Ron
Date:
On 3/13/19 9:28 AM, Thomas Güttler wrote:
> Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing 
> list.
>
> Now I realized: Nobody talked about Blobs.
>
> I guess most people do not store Blobs in PostgresSQL.
>
> Where do you store Blobs?
>
> (In my case Blobs are PDF/image files with size up to 20 MByte.
> I do not talk about very big blobs which are several hundret MByte)

We store PDFs and TIFFs of dozens to a few hundred KB in the database as bytea.

-- 
Angular momentum makes the world go 'round.


Re: Where to store Blobs?

From
Chuck Martin
Date:
I store them as bytea in the database despite the fact that there are benefits to storing them in the file system. The reason is that it is easier to secure access to the database than to secure both the database and provide secure access to the file system. 

Chuck Martin
Avondale Software


On Wed, Mar 13, 2019 at 10:34 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
Thomas Güttler wrote:
> Now I realized: Nobody talked about Blobs.
>
> I guess most people do not store Blobs in PostgresSQL.
>
> Where do you store Blobs?
>
> (In my case Blobs are PDF/image files with size up to 20 MByte.
> I do not talk about very big blobs which are several hundret MByte)

Ideally outside the database, if they are many.
Large databases are harder to backup than large file systems.

If you keep 20MB binaries in the database, you'd use the "bytea" data type.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com


Re: Where to store Blobs?

From
Christopher Browne
Date:
On Wed, 13 Mar 2019 at 10:27, Thomas Güttler
<guettliml@thomas-guettler.de> wrote:
> I guess most people do not store Blobs in PostgresSQL.
>
> Where do you store Blobs?

Things have changed, but at one time, we were using RT as our
ticketing system (https://bestpractical.com/request-tracker) and it
would capture documents as database objects.

The table where RT stowed downloadable documents was one of the
largest tables in the database because of there being a few 50MB
copies of Access Databases and some many-MB spreadsheets in there.

It worked fine; no problems evident from it.  It was certainly
surprising to find such large documents there, and if people had
gotten in the habit of putting GBs of data into RT, that would have
probably led to some policy changes to prevent it, but unless you're
pretty actively trying to blow the system up, it just works.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Where to store Blobs?

From
Karsten Hilbert
Date:
On Wed, Mar 13, 2019 at 11:50:37AM -0400, Christopher Browne wrote:

>> I guess most people do not store Blobs in PostgresSQL.

- BYTEA puts practical limits on size

- LO storage happens inside the system (!) table

Nowadays, there are Foreign Data Wrappers which might
encapsulate files as if they lived inside the database.

Also, a combination of

    COPY TO FORMAT binary
    pg_read_binary_file()

and suitable plpgsql security definer functions might provide
for a Poor Man's binary file integrated external storage.

Karsten
-- 
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B


Re: Where to store Blobs?

From
Achilleas Mantzios
Date:
On 13/3/19 4:28 μ.μ., Thomas Güttler wrote:
> Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list.
>
> Where do you store Blobs?
>
> (In my case Blobs are PDF/image files with size up to 20 MByte.
> I do not talk about very big blobs which are several hundret MByte)
>
bytea. Keeping those in the filesys and trying to keep filesys in sync with the db is a PITA. Also dont know what
happensin the transactional dept (ACID) with lo_* large objects. In most cases bytea 
 
are just fine.
>
>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



Re: Where to store Blobs?

From
Hannes Erven
Date:
Hi,


Am 13.03.19 um 15:28 schrieb Thomas Güttler:
> 
> Where do you store Blobs?


Within PostgreSQL, of course. The system I have in mind stores ZIP and 
PDF files, usually a few MBs each; we're currently at a total of about 
100 GB and there are no evident problems.

For this application, it is extremely important that saving/modifying 
the binaries is part of a larger transaction that completes/fails 
atomically.
Replication is almost real-time and access to the binaries is provided 
with the same libraries (JDBC/Hibernate) as everything else.


Best regards,

    -hannes



Re: Where to store Blobs?

From
Jamesie Pic
Date:
Make dump/restore of database data unnecessarily expensive in terms of time and space imho.

Re: Where to store Blobs?

From
Benedict Holland
Date:
I store large models in the database because I need to have a historical data to compare to. That said, I could probably also automate a git repo but it will be just that much more work and git with binary files really doesn't make sense. Storage is really cheap and I assume the database stores bytesa types like they do text within a separate and partitioned section of the database.

Thanks,
~Ben 

On Wed, Mar 13, 2019 at 1:37 PM Jamesie Pic <jpic@yourlabs.org> wrote:
Make dump/restore of database data unnecessarily expensive in terms of time and space imho.

Re: Where to store Blobs?

From
Tim Cross
Date:
I don't think there is a suitable 'one size fits all' answer to this
question. A lot will depend on how you intend to use the blobs and what
sort of hardware architecture, especially storage systems, you have.

At first glance, sticking everything in the DB seems like an easy
choice. However, that can result in very large databases, which in turn
can lead to issues with respect to backup, replication etc. If all your
after is storage, then sometimes your better off using the file system
for the blobs and keeping the metadata in the db. It can potentially be
faster and easier to serve up blobs from the file system compared to the
db if that is the main use case, but if the blobs are more dynamic or
you use collections of blobs to build a master blob etc, the db has some
advantages.

If you really need database like functionality, given the relative
cheapness of storage and the wealth of options available, storing the
blobs in the database can have advantage. However, it will be important
to select the most appropriate datatype. What some people think of as a
'blob' is just an array of bytes to many DBs and as usual, you need to
make the decision as to what is the best storage representation for your
requirements, keeping in mind that the more general 'blob' like storage
type you choose often represents a loss in functionality but an increase
in flexibility wrt to what can be inserted over more precise data types,
which will be more restrictive about what can be inserted, but offer
more functionality regarding what you can do with it (at the db level).

Tim


Thomas Güttler <guettliml@thomas-guettler.de> writes:

> Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list.
>
> Now I realized: Nobody talked about Blobs.
>
> I guess most people do not store Blobs in PostgresSQL.
>
> Where do you store Blobs?
>
> (In my case Blobs are PDF/image files with size up to 20 MByte.
> I do not talk about very big blobs which are several hundret MByte)


--
Tim Cross


s3 was: Where to store Blobs?

From
Thomas Güttler
Date:
I am curious. Why did nobody say:

  store blobs in a storage server (like s3) and only store the blob-id in PostgreSQL?

Regards,
   Thomas



Am 13.03.19 um 15:28 schrieb Thomas Güttler:
> Some days ago I asked "Where **not** to use PostgreSQL?" on this mailing list.
> 
> Now I realized: Nobody talked about Blobs.
> 
> I guess most people do not store Blobs in PostgresSQL.
> 
> Where do you store Blobs?
> 
> (In my case Blobs are PDF/image files with size up to 20 MByte.
> I do not talk about very big blobs which are several hundret MByte)
> 
> 
> 

-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


Re: s3 was: Where to store Blobs?

From
Karsten Hilbert
Date:
On Mon, Mar 18, 2019 at 11:42:00AM +0100, Thomas Güttler wrote:

> I am curious. Why did nobody say:
> 
>  store blobs in a storage server (like s3) and only store the blob-id in PostgreSQL?

That's been rehashed to the point of becoming a FAQ

    https://wiki.postgresql.org/wiki/BinaryFilesInDB

Karsten Hilbert
-- 
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B


Re: Where to store Blobs?

From
Jamesie Pic
Date:
Storing files in the database has already been demonstrated as not an efficient trade-off [0]. As such, this post provides a subjective story about files in the database, rather than an objective compare.

So, I have this ridiculous situation where dumping a database with 12k user-submitted forms with files, made by 3000 users, which I hope we can agree “is absolutely nothing”, yet, it eats 35G of PG data.

As a result, dumping the database takes already 32 minutes and is extremely write-intensive, causing more than 5 second disk backlogs during the operation at the same time.

If I had stored file paths, like I should have, the db would take a few MBs and the dump a few seconds even less than one second ?

Also, the backup could just read the file tree from the file system and synchronize only new files: we don’t have programmatic modifications of written files for legal reasons: which makes it even more ridiculous to dump them every time we do a backup.

As such, my biggest regret is to have listened to the manager who imposed this design decision, but I trusted him at the time I was a fresh hire.

Today, we have the choice of two poisons for deployment:

    not shutting down the service during the backup, to save 32 minutes of downtime, but that’s 32 minutes of writes that are at risk in the case of a faulty destructive operation, yes that doesn’t happen 99% of the time,
    shutting down the service during the backup, as we should, but that means 32 minutes of extra downtime

In my experience, storing files in the database is a critical mistake. I mean, if you know what you’re doing maybe, and I thought the said manager would know what he’s doing.

But at the end of a year the manager decided to ignore all monitoring alarms concerning disk space that were raised during the backup. As such, I kept going and freeing as much space as possible when a backup failed to happen, which hopefully canceled the full deployment, leaving the service online, even though with a full disk.

I have raised the warning to the customer for months and but the manager kept insisting that we close our eyes on it, and kept on doing palliative fixes when needed.

Of course, piling up palliatives fixes in production eventually created the situation where the disk was too full to make a backup. The manager that had installed the server OS had put 3 disks in a RAID1 array with extremely poor partitioning.

As such, i had to spend a night repartitioning the RAID1, so that the / would be on 10G instead of 40G. Which kept us going a bit more, but unfortunnately one week shorter than expected, because I had forgot to include the exponential growth in the math of the estimation.

Leading to even more incidents. If you store files in the database, which you shouldn’t unless you’re 100% sure about what you’re doing, then do not ignore disk space warnings during backups. Or else … well what do you think happens when an airplane pilot ignores the alarms on their dashboard ?

99% of incidents are a suite of predictable events.

Re: Where to store Blobs?

From
Jamesie Pic
Date:
I forgot to mention that my deployments include automated migrations as often as possible, sometimes destructive for refactoring purpose, as such, to maintain PostgreSQL on a basic linux box I am:

- for having an automated backup prior in the automated deployment script that may play destructive migrations,
- against the needless overhead of coupling both binary and relational data in operations that slows the whole thing down or makes it less reliable

Also got supposedly many new points against, mixed with more detail on the points briefly exposed in my previous email, going deeper in detail, about how it fits in the big picture of my personal practice ... and how this has destabilized my prod for months:

tl;dr
If you store media files in PostgreSQL on a production server, then do take disk space alarms seriously even if they happen only during backups.
Otherwise I fail to see how to avoid a pattern of recurring incidents, "manually unblocking automated deployments" (double debt interest cost because also defeats the purpose of automating deployment), when not filling up a disk during the nightly backup dump ...

Hope this helps,

Have a great day

Re: Where to store Blobs?

From
Morris de Oryx
Date:
Good question, and there are some excellent thoughts and cautionary tales in the thread already. I've faced this question several times down the years and, as others have said, the best answer depends. Roughly speaking, I can think of three obvious places to store documents:

* A database.

* A file system.

* A commodity file-system-like cloud service, such as S3, B2 Cloud, Azure, or Google Cloud.

There are pros and cons to each, and circumstances that will make one option or another untenable. It also depends a lot on your overall architecture, if documents can change, how your searches work, your security/regulatory requirements, the overall workflow, etc.

But given all of that, I hate bloating a database with BLOBs that aren't themselves searchable. That's what file systems are for. This past week, I've been coding against the B2 Cloud API and it's quite nice...simpler than S3, and much cheaper. In this setup, we keep a document catalog in the database that ties together a source record and an external document. The "cloud_doc" record contains identifiers and information from both parts of this story. So, descriptors, tags and IDs needed when you're looking for something. In our case, that's large XML log files. These logs are either of zero value or great value. We need to keep them, but 99.99999% of the time they're pure bloat. We've got a record about the event the log details. The cloud_doc record includes a reference back to the original, and to the storage location of the full log. So, the service type, bucket ID, document ID, etc. So the catalog record links our real data with the remote document, same as you would storing file paths to a local file tree. 

Storing the data externally has some risks and limitations that make it unsuitable in some situations. For example, you can't include the file or cloud system in a transaction in any normal sense. Also, depending on platform, there may be an unacceptable lag between pushing a file up and it becoming visible to other systems. But for common cases, external (file or cloud) storage can be pretty great.

Likewise, it's kind of scary to have the files someplace else where Bad Things Might Happen and your database and file tree are out of sync. That's genuinely worrisome. It's likely overkill, but I like to store references back into the database in the meta-data for the document up on the cloud.  On S3 and B2, you store meta-data as "tags". I haven't used the document storage systems on Azure or Google, but they likely have similar functionality. A "tag" up on the cloud repository is a name-value-pair that you can add custom data to. So, for example, I'll store the ID of our cloud document catalog record in the meta-data (tags) for the document on the cloud service. I also stash the ID of the source record that the log relates to. If all goes well, we'll never need these tags but if worst came to worst, they provide a way of calculating the parents of each external document.

That last paragraph about tags points out something important: Cloud document storage can be preferable to a standard file system for reasons other than simplified provisioning and cost. Many file systems don't allow your readily add and search meta-data like this, whereas it's a quite accessible feature of cloud storage systems.

Going in another direction entirely, there may be times when what you need is a local file storage system for documents. File systems are...scary, but what about SQLite? You have a single file with a modern SQL syntax where you can stuff BLOBs. You don't bloat your Postgres data file, but still have a database for the documents instead of a file system. SQLite only looks to solve problems in a pretty specific set of cases but, as it turns out, those cases are also quite common. Combining it with a Postgres setup seems pretty exotic, but there might be a time. It depends on your setup.

On Sat, Apr 20, 2019 at 10:59 AM Jamesie Pic <jpic@yourlabs.org> wrote:
I forgot to mention that my deployments include automated migrations as often as possible, sometimes destructive for refactoring purpose, as such, to maintain PostgreSQL on a basic linux box I am:

- for having an automated backup prior in the automated deployment script that may play destructive migrations,
- against the needless overhead of coupling both binary and relational data in operations that slows the whole thing down or makes it less reliable

Also got supposedly many new points against, mixed with more detail on the points briefly exposed in my previous email, going deeper in detail, about how it fits in the big picture of my personal practice ... and how this has destabilized my prod for months:

tl;dr
If you store media files in PostgreSQL on a production server, then do take disk space alarms seriously even if they happen only during backups.
Otherwise I fail to see how to avoid a pattern of recurring incidents, "manually unblocking automated deployments" (double debt interest cost because also defeats the purpose of automating deployment), when not filling up a disk during the nightly backup dump ...

Hope this helps,

Have a great day

Re: Where to store Blobs?

From
Jamesie Pic
Date:
Thanks for your feedback. In my case, Bob the manager said this would "make backups easier" xD

The general pro that I see is "transactional". While I can understand that, it's irrelevant in our case: users upload files through AJAX, that happens **before** they submit the form. That means, the file gets saved in an HTTP exchange that happens **before** the actual data insertion transaction.

As such, be careful too that this argument does not be irrelevant in your case like it is in mine.