Re: Where to store Blobs? - Mailing list pgsql-general

From Morris de Oryx
Subject Re: Where to store Blobs?
Date
Msg-id CAKqncchTyF3_M+rnBw=5Cn0cFLa6VQy6htt=VD7XUUpdoTiVMQ@mail.gmail.com
Whole thread Raw
In response to Re: Where to store Blobs?  (Jamesie Pic <jpic@yourlabs.org>)
Responses Re: Where to store Blobs?  (Jamesie Pic <jpic@yourlabs.org>)
List pgsql-general
Good question, and there are some excellent thoughts and cautionary tales in the thread already. I've faced this question several times down the years and, as others have said, the best answer depends. Roughly speaking, I can think of three obvious places to store documents:

* A database.

* A file system.

* A commodity file-system-like cloud service, such as S3, B2 Cloud, Azure, or Google Cloud.

There are pros and cons to each, and circumstances that will make one option or another untenable. It also depends a lot on your overall architecture, if documents can change, how your searches work, your security/regulatory requirements, the overall workflow, etc.

But given all of that, I hate bloating a database with BLOBs that aren't themselves searchable. That's what file systems are for. This past week, I've been coding against the B2 Cloud API and it's quite nice...simpler than S3, and much cheaper. In this setup, we keep a document catalog in the database that ties together a source record and an external document. The "cloud_doc" record contains identifiers and information from both parts of this story. So, descriptors, tags and IDs needed when you're looking for something. In our case, that's large XML log files. These logs are either of zero value or great value. We need to keep them, but 99.99999% of the time they're pure bloat. We've got a record about the event the log details. The cloud_doc record includes a reference back to the original, and to the storage location of the full log. So, the service type, bucket ID, document ID, etc. So the catalog record links our real data with the remote document, same as you would storing file paths to a local file tree. 

Storing the data externally has some risks and limitations that make it unsuitable in some situations. For example, you can't include the file or cloud system in a transaction in any normal sense. Also, depending on platform, there may be an unacceptable lag between pushing a file up and it becoming visible to other systems. But for common cases, external (file or cloud) storage can be pretty great.

Likewise, it's kind of scary to have the files someplace else where Bad Things Might Happen and your database and file tree are out of sync. That's genuinely worrisome. It's likely overkill, but I like to store references back into the database in the meta-data for the document up on the cloud.  On S3 and B2, you store meta-data as "tags". I haven't used the document storage systems on Azure or Google, but they likely have similar functionality. A "tag" up on the cloud repository is a name-value-pair that you can add custom data to. So, for example, I'll store the ID of our cloud document catalog record in the meta-data (tags) for the document on the cloud service. I also stash the ID of the source record that the log relates to. If all goes well, we'll never need these tags but if worst came to worst, they provide a way of calculating the parents of each external document.

That last paragraph about tags points out something important: Cloud document storage can be preferable to a standard file system for reasons other than simplified provisioning and cost. Many file systems don't allow your readily add and search meta-data like this, whereas it's a quite accessible feature of cloud storage systems.

Going in another direction entirely, there may be times when what you need is a local file storage system for documents. File systems are...scary, but what about SQLite? You have a single file with a modern SQL syntax where you can stuff BLOBs. You don't bloat your Postgres data file, but still have a database for the documents instead of a file system. SQLite only looks to solve problems in a pretty specific set of cases but, as it turns out, those cases are also quite common. Combining it with a Postgres setup seems pretty exotic, but there might be a time. It depends on your setup.

On Sat, Apr 20, 2019 at 10:59 AM Jamesie Pic <jpic@yourlabs.org> wrote:
I forgot to mention that my deployments include automated migrations as often as possible, sometimes destructive for refactoring purpose, as such, to maintain PostgreSQL on a basic linux box I am:

- for having an automated backup prior in the automated deployment script that may play destructive migrations,
- against the needless overhead of coupling both binary and relational data in operations that slows the whole thing down or makes it less reliable

Also got supposedly many new points against, mixed with more detail on the points briefly exposed in my previous email, going deeper in detail, about how it fits in the big picture of my personal practice ... and how this has destabilized my prod for months:

tl;dr
If you store media files in PostgreSQL on a production server, then do take disk space alarms seriously even if they happen only during backups.
Otherwise I fail to see how to avoid a pattern of recurring incidents, "manually unblocking automated deployments" (double debt interest cost because also defeats the purpose of automating deployment), when not filling up a disk during the nightly backup dump ...

Hope this helps,

Have a great day

pgsql-general by date:

Previous
From: Jamesie Pic
Date:
Subject: Re: Where to store Blobs?
Next
From: Pavan Kumar
Date:
Subject: postgres database complete recovery