Thread: How to store "blobs" efficiently for small and large sizes, with random access
How to store "blobs" efficiently for small and large sizes, with random access
Hi. I'd like some advice storing blobs (millions of them), ranging from very small, to large > 1GB. I know about bytea versus lo, and I have probably read most of what's out there about them :) Upfront, I have to state that I'm not keen on lo, because of security considerations. We store blobs in many different schemas, and users can access some schemas, and not others. So the fact the lo table is unique for the whole database would allow users to see blobs from any schema, as I understand it. Right? OTOH, lo has random access, which I also need... (I'm also not a fan of lo needing triggers for lifetime management) (nor of the requirement to have an explicit transaction to use lo). Here are the main requirement I need to fulfil: 1) store literally millions of rows, 1 "blob" per row. (scientific data). 2) most "blobs" are smallish, below a few KBs. bytea's perfect for that. 3) yet many blobs are on the dozens of MBs. bytea's still OK. 4) but some blobs exceed the 1GB byte limit. Found at least a dozen just in our test data, so clients will have them. 5) for accessing larger blobs, the API I must implement accesses contiguous chunks of the blobs. Thus I need random access. I'm porting the backend of that API from Oracle to PostgreSQL. In Oracle we used RAW for smaller values, and SecureFile blobs for larger ones. Oracle LOBs are similar to lo I guess, providing random access, except higher performance for large sizes, but slower than bytea for smaller one. But a PITA to deal with, with locators, breaking pre-fetching, etc... PostgreSQL bytea is much better and simpler, except limited to 1GB... Bytea also has no direct random access, except via substr[ing], but how efficient and "random access" is that? For example, SQLite stores large values in overflow pages, and has to traverse all those pages to pointer-chase the chain of those for large blobs, so the substr() "random access" is NOT O(1) and involves way too much IO (i.e. O(N)). Do TOAST'ed large values (text or bytea) fare better in term of substr() efficiency, close to O(logN) and only the necessary IO for what's actually requested by the substr() (modulo the extrema 2K chunks of TOAST)? Some posts on the internet also peg lo to be faster above 20MB compared to bytea, for example. Is that even true? One post I read (from Daniel Verite I think) kinda hinted the lo table is not that different from TOAST ones, with 2K chunks, thus I don't see why lo would be any faster than a TOAST'ed bytea for example. Any insights from someone with knowledge of the internals can share on this? At this point, I have not yet redone the extensive benchmarking we did a few years ago between Oracle and PostgreSQL, which went into lo versus bytea then. And benchmark are hard to do, not having easy access to different kind of servers with different storage backends, or cloud-hosted PG. It's too easy to get biais from a local setup, leading to a design that'd not perform optimally in a different one. That's why I'm asking a more open-ended question to experts on this list. Because of #4 above, I need to either use lo (but see above, notably the security concern), or roll-up my own TOAST^2 (squared), as I call it, where I manually "shard" / chunk large blobs in an auxiliary table, itself TOAST'ed of course, with some threshold for chunks (e.g. 1 or 4 or 16 MB perhaps). The latter keeps the blobs in the schema (good for security), lifetime is managed by FKs (as usual), and the sharding limits the ill-effects of "emulating" random-access with substr() if necessary. I've already done things like this in SQLite land (also has the 1GB limit for its text and blob types). So is this a terrible idea? What alternatives do I have? I'd really appreciate some expert advice on the above, before I go too far down the rabbit hole. Thanks, --DD PS: Another drawback of lo is that because it's a single table, it still subject to the 32 TB limit on a relation. The 4TB limit per lo is way more than we need, but the 32 TB limit may actually be more of an issue for our larger clients, which have thousands of projects, each with upwards of a few millions of those blobs. bytea values being stored in different schemas (per-project, a design constraint), puts that limit per-project which will be more than enough. For the sum of all projects, maybe not... I.e. with real client-case of 3K projects, that puts an average of only 10GB of lo's per-project (i.e. schema), which could very well be problematic...
Sv: How to store "blobs" efficiently for small and large sizes, with random access
Hi. I'd like some advice storing blobs (millions of them), ranging
from very small, to large > 1GB.
I know about bytea versus lo, and I have probably read most of what's
out there about them :)
Upfront, I have to state that I'm not keen on lo, because of security
considerations. We store
blobs in many different schemas, and users can access some schemas,
and not others. So
the fact the lo table is unique for the whole database would allow
users to see blobs from any
schema, as I understand it. Right? OTOH, lo has random access, which I
also need...
(I'm also not a fan of lo needing triggers for lifetime management)
(nor of the requirement to have an explicit transaction to use lo).
Here are the main requirement I need to fulfil:
1) store literally millions of rows, 1 "blob" per row. (scientific data).
2) most "blobs" are smallish, below a few KBs. bytea's perfect for that.
3) yet many blobs are on the dozens of MBs. bytea's still OK.
4) but some blobs exceed the 1GB byte limit. Found at least a dozen
just in our test data, so clients will have them.
5) for accessing larger blobs, the API I must implement accesses
contiguous chunks of the blobs. Thus I need random access.
I'm porting the backend of that API from Oracle to PostgreSQL.
In Oracle we used RAW for smaller values, and SecureFile blobs for larger ones.
Oracle LOBs are similar to lo I guess, providing random access, except
higher performance for large sizes,
but slower than bytea for smaller one. But a PITA to deal with, with
locators, breaking pre-fetching, etc...
PostgreSQL bytea is much better and simpler, except limited to 1GB...
Bytea also has no direct random access, except via substr[ing], but
how efficient and "random access" is that?
For example, SQLite stores large values in overflow pages, and has to
traverse all those pages to pointer-chase
the chain of those for large blobs, so the substr() "random access" is
NOT O(1) and involves way too much IO (i.e. O(N)).
Do TOAST'ed large values (text or bytea) fare better in term of
substr() efficiency, close to O(logN) and only
the necessary IO for what's actually requested by the substr() (modulo
the extrema 2K chunks of TOAST)?
Some posts on the internet also peg lo to be faster above 20MB
compared to bytea, for example. Is that even true?
One post I read (from Daniel Verite I think) kinda hinted the lo table
is not that different from TOAST ones,
with 2K chunks, thus I don't see why lo would be any faster than a
TOAST'ed bytea for example.
Any insights from someone with knowledge of the internals can share on this?
At this point, I have not yet redone the extensive benchmarking we did
a few years ago between
Oracle and PostgreSQL, which went into lo versus bytea then. And
benchmark are hard to do, not
having easy access to different kind of servers with different storage
backends, or cloud-hosted PG.
It's too easy to get biais from a local setup, leading to a design
that'd not perform optimally in a different one.
That's why I'm asking a more open-ended question to experts on this list.
Because of #4 above, I need to either use lo (but see above, notably
the security concern),
or roll-up my own TOAST^2 (squared), as I call it, where I manually
"shard" / chunk large blobs
in an auxiliary table, itself TOAST'ed of course, with some threshold
for chunks (e.g. 1 or 4 or 16 MB perhaps).
The latter keeps the blobs in the schema (good for security), lifetime
is managed by FKs (as usual),
and the sharding limits the ill-effects of "emulating" random-access
with substr() if necessary.
I've already done things like this in SQLite land (also has the 1GB
limit for its text and blob types).
So is this a terrible idea? What alternatives do I have?
I'd really appreciate some expert advice on the above, before I go too
far down the rabbit hole.
Thanks, --DD
PS: Another drawback of lo is that because it's a single table, it
still subject to the 32 TB limit on a relation.
The 4TB limit per lo is way more than we need, but the 32 TB limit
may actually be more of an issue for
our larger clients, which have thousands of projects, each with
upwards of a few millions of those blobs.
bytea values being stored in different schemas (per-project, a
design constraint), puts that limit per-project
which will be more than enough. For the sum of all projects, maybe
not... I.e. with real client-case of 3K projects,
that puts an average of only 10GB of lo's per-project (i.e. schema),
which could very well be problematic...
Attachment
Re: How to store "blobs" efficiently for small and large sizes, with random access
First advice, don't do it. We started off storing blobs in DB for “TX safety”
, but backup/restore quickly became too cumbersome so we ended up moving all blobs out and only store reference in DB.
This required us to make a “vacuum system” that cleans up the blob-storage regularly as ROLLBACK/crash can make it out of sync.
We chose storing as LO because with it, streaming large blobs (not using much memory) actually worked, with JDBC at least.
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
First advice, don't do it. We started off storing blobs in DB for “TX safety”Not really an option, I'm afraid., but backup/restore quickly became too cumbersome so we ended up moving all blobs out and only store reference in DB.This required us to make a “vacuum system” that cleans up the blob-storage regularly as ROLLBACK/crash can make it out of sync.Note the fact the data is spread in many mostly independent schemas.Might ease the backup/restore? I'm not much of a DBA though...
We chose storing as LO because with it, streaming large blobs (not using much memory) actually worked, with JDBC at least.
I'm in C++, with I believe efficient use of binary binds and results, and use of COPY as much as possible,so as good as it gets I think (that I know of, with libpq), in terms of performance. Haven't looked at libpq'snew in v14 pipeline mode yet though.Investigated Cursor vs Statement too, and it's a tradeoff between latency and throughput.Not sure how good JDBC drivers for PostgreSQL are. I'm doing my best to go fast with libpq.In any case, thanks for your input. But it's not really a question of "if". But of "how".Putting thousands of large blobs in the file system is a no go. Assuming the clientscan even see the file system the server sees. This is a 2-tier system, there's no mid-tierthat would somehow magically handle proper security and lifetime management of these blobs.Thanks, --DD
Ok, just something to think about; Will your database grow beyond 10TB with blobs? If so try to calculate how long it takes to restore, and comply with SLA, and how long it would have taken to restore without the blobs.
PS: Our blobstore is not “the file system”, but SeaweedFS.
Attachment
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Wed, Oct 19, 2022 at 1:00 PM Andreas Joseph Krogh <andreas@visena.com> wrote: > Ok, just something to think about; Thank you. I do appreciate the feedback. > Will your database grow beyond 10TB with blobs? The largest internal store I've seen (for the subset of data that goes in the DB) is shy of 3TB. But we are an ISV, not one of our clients, which have truly massive scale for data. And they don't share the exact scale of their proprietary data with me... > If so try to calculate how long it takes to restore, and comply with SLA, > and how long it would have taken to restore without the blobs. Something I don't quite get is why somehow backup is no longer needed if the large blobs are external? i.e. are you saying backups are so much more worse in PostgreSQL than with the FS? I'm curious now. Also, managing the PostgreSQL server will be the client's own concern mostly. We are not into Saas here. As hinted above, the truly massive data is already not in the DB, used by different systems, and processed down to the GB sized inputs all the data put in the DB is generated from. It's a scientific data heavy environment. And one where security of the data is paramount, for contractual and legal reasons. Files make that harder IMHO. Anyways, this is straying from the main theme of this post I'm afraid. Hopefully we can come back on the main one too. --DD
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Wed, Oct 19, 2022 at 1:00 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
> Ok, just something to think about;
Thank you. I do appreciate the feedback.
> Will your database grow beyond 10TB with blobs?
The largest internal store I've seen (for the subset of data that goes
in the DB) is shy of 3TB.
But we are an ISV, not one of our clients, which have truly massive
scale for data.
And they don't share the exact scale of their proprietary data with me...
> If so try to calculate how long it takes to restore, and comply with SLA,
> and how long it would have taken to restore without the blobs.
Something I don't quite get is why somehow backup is no longer needed
if the large blobs are external?
i.e. are you saying backups are so much more worse in PostgreSQL than
with the FS? I'm curious now.
I'm not saying you don't need backup (or redundancy) of other systems holding blobs, but moving them out of RDBMS makes you restore the DB to a consistent state, and able to serve clients, faster. In my experience It's quite unlikely that your (redundant) blob-store needs crash-recovery at the same time you DB does. The same goes with PITR, needed because of some logical error (like client deleted some data they shouldn't have), which is much faster without blobs in DB and doesn't affect the blobstore at all (if you have a smart insert/update/delete-policy there).
Also, managing the PostgreSQL server will be the client's own concern
mostly. We are not into Saas here.
As hinted above, the truly massive data is already not in the DB, used
by different systems, and processed
down to the GB sized inputs all the data put in the DB is generated
from. It's a scientific data heavy environment.
And one where security of the data is paramount, for contractual and
legal reasons. Files make that harder IMHO.
Anyways, this is straying from the main theme of this post I'm afraid.
Hopefully we can come back on the main one too. --DD
There's a reason “everybody” advices to move blobs out of DB, I've learned.
Attachment
Re: How to store "blobs" efficiently for small and large sizes, with random access
There's a reason “everybody” advices to move blobs out of DB, I've learned.
Re: How to store "blobs" efficiently for small and large sizes, with random access
On 19 Oct 2022, at 7:01 PM, Dominique Devienne <ddevienne@gmail.com> wrote:
On Wed, Oct 19, 2022 at 1:38 PM Andreas Joseph Krogh <andreas@visena.com> wrote:There's a reason “everybody” advices to move blobs out of DB, I've learned.I get that. I really do. But the alternative has some real downsides too.Especially around security, as I already mentioned. That's why I'd like if possibleto get input on the technical questions of my initial post.That's not to say we wouldn't ultimately move out the big blobs outside the DB.But given how much that would complexify the project, I do believe it is betterto do it as a second step, once the full system is up-and-running and testing atscale has actually been performed.We've already moved other kind of data to PostgreSQL, from SQLite DBs (thousands) this time,and ported "as-is" the sharding done on the SQLite side to PostgreSQL (despite TOAST).And so far, so good. With good ingestion rates. And decent runtime access to data too,in the albeit limited testing we've had so far.Now we need to move this other kind of data, from proprietary DB-like files this times (thousands too),to finish our system, and be able to finally test the whole system in earnest, and at (our limited internal) scale.So you see, I'm not completely ignoring your advise.But for now, I'm inquiring as to the *best* way to put that data *in* PostgreSQL,with the requirements / constraints I've listed in the first post.It may indeed be a bad idea long term. But let's make the most of it for now.Makes sense? Am I being unreasonable here? --DD
Re: How to store "blobs" efficiently for small and large sizes, with random access
Dominique Devienne wrote: > the fact the lo table is unique for the whole database would allow > users to see blobs from any schema, as I understand it. Direct access to pg_largeobject is only possible for superusers. If lo_compat_privileges is on, any user can read any large object with the lo* functions. If it's off, they can read a large object only if they're the owner or they have been granted permissions with GRANT { { SELECT | UPDATE } [, ...] | ALL [ PRIVILEGES ] } ON LARGE OBJECT loid [, ...] TO role_specification [, ...] [ WITH GRANT OPTION ] [ GRANTED BY role_specification ] Each large object has its own set of permissions. This is a significant difference with bytea, since every creation of a new large object may need to be followed by GRANT statements. Also if the roles and the access policies are changed in the lifetime of the app, that might imply massive REVOKE/GRANT statements to apply to existing objects. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: How to store "blobs" efficiently for small and large sizes, with random access
On 2022-Oct-19, Dominique Devienne wrote: > Upfront, I have to state that I'm not keen on lo, because of security > considerations. We store blobs in many different schemas, and users > can access some schemas, and not others. So the fact the lo table is > unique for the whole database would allow users to see blobs from any > schema, as I understand it. Right? OTOH, lo has random access, which I > also need... Generally speaking, bytea sucks for random access, because if a TOAST item is compressed, it has to be always read from the beginning in order to decompress correctly. However, if you set ALTER TABLE tab ALTER COLUMN byt SET STORAGE EXTERNAL then compression is not used, and random access becomes fast. https://www.postgresql.org/docs/15/sql-altertable.html -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "No hay hombre que no aspire a la plenitud, es decir, la suma de experiencias de que un hombre es capaz"
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Wed, Oct 19, 2022 at 3:01 PM Daniel Verite <daniel@manitou-mail.org> wrote: > Dominique Devienne wrote: > > the fact the lo table is unique for the whole database would allow > > users to see blobs from any schema, as I understand it. > Each large object has its own set of permissions. This is a significant > difference with bytea, since every creation of a new large object > may need to be followed by GRANT statements. > Also if the roles and the access policies are changed in the > lifetime of the app, that might imply massive REVOKE/GRANT > statements to apply to existing objects. Thank you Daniel. Very interesting, and something I definitely didn't know. I believe that's doable, given our design on ROLEs, but would for sure be both a PITA, and additional management / code to deal with. At least GRANTs are transactional like the new LO oids themselves, I think, so now I know it would be possible to properly secure the LOs. This insight is greatly appreciated. --DD
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Wed, Oct 19, 2022 at 3:05 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2022-Oct-19, Dominique Devienne wrote: > > OTOH, lo has random access, which I also need... > > Generally speaking, bytea sucks for random access, because if a TOAST > item is compressed, it has to be always read from the beginning in order > to decompress correctly. However, if you set > ALTER TABLE tab ALTER COLUMN byt SET STORAGE EXTERNAL > then compression is not used, and random access becomes fast. Thank you Álvaro. Quite insightful as well. I was aware of TOAST compression, but didn't make the connection to the side-effects on random-access. But now that TOAST has LZ4 support, which decompresses extremely fast, compared to ZLib (I have experience with LZ4 for a WebSocket-based server messages), and choosing an appropriately small shard/chunk size, that might be mitigated somewhat. Would need testing / benchmarking to compare uncompressed vs LZ4, at various chunk and subset/offset sizes, of course. Anybody has an answer to my question regarding how substr() works on bytea values? I.e. is it "pushed down" / optimized enough that it avoids reading the whole N-byte value, to then pass it to substr(), which then returns an M-byte value (where M < N)? If TOAST stores 2,000 chunks, and those chunks' PKs are numbers, simple arithmetic should be able to select only the chunks of interest, those incurring only the necessary IO for the selected range, no? Or the fact subsetting a bytea currently requires substr() prevents using such a scenario? And if so, why not support a native subsetting notation that did support that scenario, like the obvious bytea_col[offset, count] or bytea_col[start:end]? Seems to be me efficient native subsetting of varlength values would be quite valuable.
Re: How to store "blobs" efficiently for small and large sizes, with random access
Dominique Devienne wrote: > PostgreSQL bytea is much better and simpler, except limited to 1GB... > Bytea also has no direct random access, except via substr[ing], but > how efficient and "random access" is that? Bytea contents are compressed before being sliced (in chunks of TOAST_MAX_CHUNK_SIZE bytes, typically it's 2000 IIRC), so it's not possible to access a piece of data without decompressing the contents before it. By contrast large objects are sliced before compression, so the performance of random access is likely to be completely different. > Here are the main requirement I need to fulfil: > 1) store literally millions of rows, 1 "blob" per row. (scientific data). > 2) most "blobs" are smallish, below a few KBs. bytea's perfect for that. > 3) yet many blobs are on the dozens of MBs. bytea's still OK. > 4) but some blobs exceed the 1GB byte limit. Found at least a dozen > just in our test data, so clients will have them. > 5) for accessing larger blobs, the API I must implement accesses > contiguous chunks of the blobs. Thus I need random access. In your case I would probably opt for bytea (as opposed to large objects), and slicing the blobs in the application in chunks of a fixed size much larger than what TOAST does (for instance, somewhere between 128 kB and 32 MB). That is, having a table like: create table blobs ( blob_id some_type, chunk_no int, /* 0->N */ chunk bytea ); It's not as simple as using a single bytea field or large objects, but overall it avoids the management difficulties of both large objects and very large contents in bytea columns. Random access is achieved by skipping the chunks before the requested piece of data. If the app is able to read/write the chunks in binary mode, its queries should perform as well as the large objects functions. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On 10/19/22 04:47, Dominique Devienne wrote: [snip] > PS: Another drawback of lo is that because it's a single table, it > still subject to the 32 TB limit on a relation. > The 4TB limit per lo is way more than we need, but the 32 TB limit > may actually be more of an issue for > our larger clients, which have thousands of projects, each with > upwards of a few millions of those blobs. Partition the table on the first segment of the Primary Key. From experience, anything else can lead to serious query degradation. -- Angular momentum makes the world go 'round.
.ck-content { --ck-color-image-caption-background: #f7f7f7; --ck-color-image-caption-text: #333333; --ck-color-mention-background: #990030E6; --ck-color-mention-text: #990030; --ck-color-table-caption-background: #f7f7f7; --ck-color-table-caption-text: #333333; --ck-highlight-marker-blue: #72ccfd; --ck-highlight-marker-green: #62f962; --ck-highlight-marker-pink: #fc7899; --ck-highlight-marker-yellow: #fdfd77; --ck-highlight-pen-green: #128a00; --ck-highlight-pen-red: #e71313; --ck-image-style-spacing: 1.5em; --ck-spacing-large: 2px; --ck-inline-image-style-spacing: calc(var(--ck-image-style-spacing) / 2); --ck-todo-list-checkmark-size: 16px; /* This works when email is READ in Visena-reader, because there the content of the email is in "shadow-DOM", which has it's own :root, but needs to be declared in ".ck-content p, .ck-content div" as well for the actual CKEditor to display it correctly, else it get's its default from .liftTemplate */ font-family: Arial, Helvetica, sans-serif; font-size: 14px; }.ck-content .image.image_resized { max-width: 100%; display: block; box-sizing: border-box; }.ck-content .image.image_resized img { width: 100%; }.ck-content .image.image_resized > figcaption { display: block; }.ck-content .image > figcaption { display: table-caption; caption-side: bottom; word-break: break-word; color: var(--ck-color-image-caption-text); background-color: var(--ck-color-image-caption-background); padding: .6em; font-size: .75em; outline-offset: -1px; }.ck-content p, .ck-content div { font-family: Arial, Helvetica, sans-serif; font-size: 14px; }.ck-content p.visenaNormal { margin: 0; }.ck-content .text-tiny { font-size: .7em; }.ck-content .text-small { font-size: .85em; }.ck-content .text-big { font-size: 1.4em; }.ck-content .text-huge { font-size: 1.8em; }.ck-content .image-style-block-align-left, .ck-content .image-style-block-align-right { max-width: calc(100% - var(--ck-image-style-spacing)); }.ck-content .image-style-align-left, .ck-content .image-style-align-right { clear: none; }.ck-content .image-style-side { float: right; margin-left: var(--ck-image-style-spacing); max-width: 50%; }.ck-content .image-style-align-left { float: left; margin-right: var(--ck-image-style-spacing); }.ck-content .image-style-align-center { margin-left: auto; margin-right: auto; }.ck-content .image-style-align-right { float: right; margin-left: var(--ck-image-style-spacing); }.ck-content .image-style-block-align-right { margin-right: 0; margin-left: auto; }.ck-content .image-style-block-align-left { margin-left: 0; margin-right: auto; }.ck-content p + .image-style-align-left, .ck-content p + .image-style-align-right, .ck-content p + .image-style-side { margin-top: 0; }.ck-content .image-inline.image-style-align-left, .ck-content .image-inline.image-style-align-right { margin-top: var(--ck-inline-image-style-spacing); margin-bottom: var(--ck-inline-image-style-spacing); }.ck-content .image-inline.image-style-align-left { margin-right: var(--ck-inline-image-style-spacing); }.ck-content .image-inline.image-style-align-right { margin-left: var(--ck-inline-image-style-spacing); }.ck-content .image { display: table; clear: both; text-align: center; margin: 0 auto; min-width: 50px; }.ck-content .image img { display: block; margin: 0 auto; max-width: 100%; min-width: 100%; }.ck-content .image-inline { /* * Normally, the .image-inline would have "display: inline-block" and "img { width: 100% }" (to follow the wrapper while resizing).; * Unfortunately, together with "srcset", it gets automatically stretched up to the width of the editing root. * This strange behavior does not happen with inline-flex. */ display: inline-flex; max-width: 100%; align-items: flex-start; }.ck-content .image-inline picture { display: flex; }.ck-content .image-inline picture, .ck-content .image-inline img { flex-grow: 1; flex-shrink: 1; max-width: 100%; }.ck-content .marker-yellow { background-color: var(--ck-highlight-marker-yellow); }.ck-content .marker-green { background-color: var(--ck-highlight-marker-green); }.ck-content .marker-pink { background-color: var(--ck-highlight-marker-pink); }.ck-content .marker-blue { background-color: var(--ck-highlight-marker-blue); }.ck-content .pen-red { color: var(--ck-highlight-pen-red); background-color: transparent; }.ck-content .pen-green { color: var(--ck-highlight-pen-green); background-color: transparent; }.ck-content hr { margin: 15px 0; height: 4px; background: #dedede; border: 0; }.ck-content blockquote { overflow: hidden; padding-right: 0; padding-left: 1ex; margin-left: 0; margin-right: 0; font-style: unset; border-left: solid 1px #cccccc; }.ck-content .blockquote { font-style: unset; }.ck-content[dir="rtl"] blockquote { border-left: 0; border-right: solid 1px #cccccc; }.ck-content code { background-color: #c7c7c7; padding: 0 1px; font-size: small; border-radius: 2px; }.ck-content .table > figcaption { display: table-caption; caption-side: top; word-break: break-word; text-align: center; color: var(--ck-color-table-caption-text); background-color: var(--ck-color-table-caption-background); padding: .6em; font-size: .75em; outline-offset: -1px; }.ck-content .table { margin: 0 auto; display: table; }.ck-content .table table { border-collapse: collapse; border-spacing: 0; width: 100%; height: 100%; border: 1px double #b3b3b3; }.ck-content .table table td, .ck-content .table table th { min-width: 2em; padding: .4em; border: 1px solid #bfbfbf; }.ck-content .table table th { font-weight: bold; background: #000000E6; }.ck-content[dir="rtl"] .table th { text-align: right; }.ck-content[dir="ltr"] .table th { text-align: left; }.ck-content .table { margin-left: 0; }.ck-content .table table { }.ck-content .table table td { }.ck-content .page-break { position: relative; clear: both; padding: 5px 0; display: flex; align-items: center; justify-content: center; }.ck-content .page-break::after { content: ''; position: absolute; border-bottom: 2px dashed #c4c4c4; width: 100%; }.ck-content .page-break__label { position: relative; z-index: 1; padding: .3em .6em; display: block; text-transform: uppercase; border: 1px solid #c4c4c4; border-radius: 2px; font-family: Arial, Helvetica, sans-serif; font-size: 0.75em; font-weight: bold; color: #333333; background: #ffffff; box-shadow: 2px 2px 1px #000000; -webkit-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; }.ck-content .media { clear: both; margin: 0 0; display: block; min-width: 15em; }.ck-content .todo-list { list-style: none; }.ck-content .todo-list li { margin-bottom: 5px; }.ck-content .todo-list li .todo-list { margin-top: 5px; }.ck-content .todo-list .todo-list__label > input { -webkit-appearance: none; display: inline-block; position: relative; width: var(--ck-todo-list-checkmark-size); height: var(--ck-todo-list-checkmark-size); vertical-align: middle; border: 0; left: -25px; margin-right: -15px; right: 0; margin-left: 0; }.ck-content .todo-list .todo-list__label > input::before { display: block; position: absolute; box-sizing: border-box; content: ''; width: 100%; height: 100%; border: 1px solid #333333; border-radius: 2px; transition: 250ms ease-in-out box-shadow, 250ms ease-in-out background, 250ms ease-in-out border; }.ck-content .todo-list .todo-list__label > input::after { display: block; position: absolute; box-sizing: content-box; pointer-events: none; content: ''; left: calc( var(--ck-todo-list-checkmark-size) / 3 ); top: calc( var(--ck-todo-list-checkmark-size) / 5.3 ); width: calc( var(--ck-todo-list-checkmark-size) / 5.3 ); height: calc( var(--ck-todo-list-checkmark-size) / 2.6 ); border-style: solid; border-color: transparent; border-width: 0 calc( var(--ck-todo-list-checkmark-size) / 8 ) calc( var(--ck-todo-list-checkmark-size) / 8 ) 0; transform: rotate(45deg); }.ck-content .todo-list .todo-list__label > input[checked]::before { background: #26ab33; border-color: #26ab33; }.ck-content .todo-list .todo-list__label > input[checked]::after { border-color: #ffffff; }.ck-content .todo-list .todo-list__label .todo-list__label__description { vertical-align: middle; }.ck-content span[lang] { font-style: italic; }.ck-content pre { padding: 1em; color: #353535; background: #c7c7c7; border: 1px solid #c4c4c4; border-radius: 2px; text-align: left; direction: ltr; tab-size: 4; white-space: pre-wrap; font-style: normal; min-width: 200px; }.ck-content pre code { background: unset; padding: 0; border-radius: 0; }.ck-content .mention { background: var(--ck-color-mention-background); color: var(--ck-color-mention-text); } På onsdag 19. oktober 2022 kl. 13:21:38, skrev Dominique Devienne <ddevienne@gmail.com>:On Wed, Oct 19, 2022 at 1:00 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
> Ok, just something to think about;
Thank you. I do appreciate the feedback.
> Will your database grow beyond 10TB with blobs?
The largest internal store I've seen (for the subset of data that goes
in the DB) is shy of 3TB.
But we are an ISV, not one of our clients, which have truly massive
scale for data.
And they don't share the exact scale of their proprietary data with me...
> If so try to calculate how long it takes to restore, and comply with SLA,
> and how long it would have taken to restore without the blobs.
Something I don't quite get is why somehow backup is no longer needed
if the large blobs are external?
i.e. are you saying backups are so much more worse in PostgreSQL than
with the FS? I'm curious now.I'm not saying you don't need backup (or redundancy) of other systems holding blobs, but moving them out of RDBMS makes you restore the DB to a consistent state, and able to serve clients, faster. In my experience It's quite unlikely that your (redundant) blob-store needs crash-recovery at the same time you DB does. The same goes with PITR, needed because of some logical error (like client deleted some data they shouldn't have), which is much faster without blobs in DB and doesn't affect the blobstore at all (if you have a smart insert/update/delete-policy there).
This is nothing to sneeze at. Backing up a 30TB database takes a long time
Also, managing the PostgreSQL server will be the client's own concern
mostly. We are not into Saas here.
As hinted above, the truly massive data is already not in the DB, used
by different systems, and processed
down to the GB sized inputs all the data put in the DB is generated
from. It's a scientific data heavy environment.
And one where security of the data is paramount, for contractual and
legal reasons. Files make that harder IMHO.
Anyways, this is straying from the main theme of this post I'm afraid.
Hopefully we can come back on the main one too. --DDThere's a reason “everybody” advices to move blobs out of DB, I've learned.
We deal with an ISV maintaining a banking application. It stores scanned images of checks as bytea fields in a Postgresql 9.6 database. The next version will store the images outside of the database.
--Andreas Joseph KroghCTO / Partner - Visena ASMobile: +47 909 56 963
Angular momentum makes the world go 'round.
Attachment
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Wed, 2022-10-19 at 12:48 +0200, Dominique Devienne wrote: > On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote: > > First advice, don't do it. We started off storing blobs in DB for “TX safety” > > Not really an option, I'm afraid. You should reconsider. Ruling out that option now might get you into trouble later. Large Objects mean trouble. Yours, Laurenz Albe
Re: How to store "blobs" efficiently for small and large sizes, with random access
On 2022-Oct-19, Dominique Devienne wrote: > Anybody has an answer to my question regarding how substr() works on > bytea values? I.e. is it "pushed down" / optimized enough that it > avoids reading the whole N-byte value, to then pass it to substr(), > which then returns an M-byte value (where M < N)? > > If TOAST stores 2,000 chunks, and those chunks' PKs are numbers, > simple arithmetic should be able to select only the chunks of > interest, those incurring only the necessary IO for the selected > range, no? That's exactly what I was trying to say. If there's no compression, we don't read prior chunks. (This is valid for bytea, at least; for textual types we have to worry about multibyte characters, which are again a potential source of confusion regarding the exact location you want to seek.) This can be seen in detoast_attr_slice() in src/backend/access/common/detoast.c, though there are way too many^W^W^W multiple layers of indirection if you start from bytea_substr() in varlena.c. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Wed, Oct 19, 2022 at 4:29 PM Daniel Verite <daniel@manitou-mail.org> wrote: > In your case I would probably opt for bytea (as opposed to large > objects), and slicing the blobs in the application in chunks Thanks for the advice, and the valuable info on LO permissions. --DD
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Wed, Oct 19, 2022 at 5:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > That's exactly what I was trying to say. > If there's no compression, we don't read prior chunks. Great to read that. I'll probably try to benchmark w/ and w/o compression eventually. Need to deal with other issues first, will take a while to report back on this. > This can be seen in detoast_attr_slice() Thank you for the pointer. I'll be sure to have a look at that code. And thanks again for chiming in, with very useful info Alvaro.
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Wed, Oct 19, 2022 at 5:05 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > On Wed, 2022-10-19 at 12:48 +0200, Dominique Devienne wrote: > > On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote: > > > First advice, don't do it. We started off storing blobs in DB for “TX safety” > > Not really an option, I'm afraid. > You should reconsider. Ruling out that option now might get you into trouble > later. Large Objects mean trouble. Andreas, Ericson, Laurenz, thanks for the advice. I'll be sure to discuss these concerns with the team. We have other (bigger) data in the file system, albeit more of a read-only nature though perhaps. And this is an area I'm not familiar with how security is handled, so I'll investigate it to see if a path forward to externalize the largish blobs (currently destined to live in the DB) is possible. So I hope you can see I'm not dismissing what you guys are saying. But before I finish this thread for now, I'd like to add that I consider unfortunate a state of affairs where NOT putting the data in the DB is the mostly agreed upon advice. It IMHO points to a weak point of PostgreSQL, which does not invest in those use-cases with large data, perhaps with more file-system like techniques. Probably because most of the large users of PostgreSQL are more on the "business" side (numerous data, but on the smaller sizes) than the "scientific" side, which (too often) uses files and files-in-a-file formats like HDF5. FWIW, when Oracle introduced SecureFile blobs years ago in v11, it represented a leap forward in performance, and back then we were seeing them being 3x faster than LO at GB sizes, if I recall correctly, with throughput that challenged regular networked file-system like NFS. That was over 10 years ago, so who knows where we are now. And from the posts here, the issues with large blobs may be more related to backup/restore perhaps, than runtime performance. Having all the data in the DB, under a single security model, is a big win for consistency and simplicity. And the fact it's not really possible now is a pity, in my mind. My (probably uninformed) opinion on this is the large blobs are handled just like other relational data, in paged storage designed for smaller data. I.e. file-like blobs are shoehorned into structures which are inappropriate for them, and that a rethink and redesign is necessary specifically for them, similar to the Oracle SecureFile one of old. I have similar gripes with SQLite, which is otherwise a fantastic embedded DB. Just see how the SQLite-based Fossil-SCM fails to scale for very large repo with big (e.g. game) assets, and how it similarly failed to scale in SVN a long time ago, to be replaced by a forest-of-files (which GIT also uses). DBs like PostgreSQL and SQLite should be better at this. And I hope they get there eventually. Sorry to turn a bit philosophical at this. It's not a critic per-se. More of the personal musing of a dev in this space for a long time. FWIW. Thanks, --DD
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Wed, Oct 19, 2022 at 5:05 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> On Wed, 2022-10-19 at 12:48 +0200, Dominique Devienne wrote:
> > On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
> > > First advice, don't do it. We started off storing blobs in DB for “TX safety”
> > Not really an option, I'm afraid.
> You should reconsider. Ruling out that option now might get you into trouble
> later. Large Objects mean trouble.
Andreas, Ericson, Laurenz, thanks for the advice.
I'll be sure to discuss these concerns with the team.
[…]
But before I finish this thread for now, I'd like to add that I
consider unfortunate a state of affairs where
NOT putting the data in the DB is the mostly agreed upon advice. It
IMHO points to a weak point of
PostgreSQL, which does not invest in those use-cases with large data,
perhaps with more file-system
like techniques. Probably because most of the large users of
PostgreSQL are more on the "business"
side (numerous data, but on the smaller sizes) than the "scientific"
side, which (too often) uses files and
files-in-a-file formats like HDF5.
[…]
Attachment
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Thu, 2022-10-20 at 10:32 +0200, Dominique Devienne wrote: > But before I finish this thread for now, I'd like to add that I > consider unfortunate a state of affairs where > NOT putting the data in the DB is the mostly agreed upon advice. It > IMHO points to a weak point of > PostgreSQL, which does not invest in those use-cases with large data, > perhaps with more file-system > like techniques. Relational databases are general-purpose software, but that doesn't mean that they are the right tool for everything. And storing large files is not what they are good at. File systems are the right tool here. > FWIW, when Oracle introduced SecureFile blobs years ago in v11, it > represented a leap forward in > performance, and back then we were seeing them being 3x faster than LO > at GB sizes, if I recall correctly, I don't know what exactly they are, but I suspect that they are just files (segments?) in Oracle's "file system" (tablespaces/datafiles). So pretty much what we recommend. Yours, Laurenz Albe
Aw: Re: How to store "blobs" efficiently for small and large sizes, with random access
> I don't know what exactly they are, but I suspect that they are just > files (segments?) in Oracle's "file system" (tablespaces/datafiles). > So pretty much what we recommend. Maybe so, but if those large segments are presented "seamlessly" in the form of a table integrated with PGs access/security infrastructure that would be really helpful for some scenarios. A view-on-top-of-file_fdw kind of thing ? LO seems to nearly be there by now, or am I misunderstanding ? Karsten
On 10/20/22 03:32, Dominique Devienne wrote: [snip] > And from the posts here, the issues > with large blobs may be more > related to backup/restore perhaps, than runtime performance. From my long experience as a first a programmer and then a database administrator, backups and archiving are at the very bottom of programmers' priority list, whereas they're near the top of a DBA's priority list. > Having all the data in the DB, under a single security model, is a big > win for consistency and simplicity. No doubt. -- Angular momentum makes the world go 'round.
Re: How to store "blobs" efficiently for small and large sizes, with random access
On Thu, Oct 20, 2022 at 12:21 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > On Thu, 2022-10-20 at 10:32 +0200, Dominique Devienne wrote: > > FWIW, when Oracle introduced SecureFile blobs years ago in v11, it > > represented a leap forward in > > performance, and back then we were seeing them being 3x faster than LO > > at GB sizes, if I recall correctly, > > I don't know what exactly they are, but I suspect that they are just > files (segments?) in Oracle's "file system" (tablespaces/datafiles). > So pretty much what we recommend. I don't think so :) Pretty much the opposite in fact. Don't let the term SecureFile foul you. From the DB client's perspective, they are a server-side value like any other, living in a row/col (relation/tuple), pretty much like a BYTEA value. But unlike BYTEA, and like LO, what you SELECT or UPDATE is a LOB Locator, not the content itself, then used in separate APIs, so very much like LO. So you get the best of both BYTEA (acts like a value, in a tab/col), and LO (random access). Fully transactional. In fact, the LOB locator in a handle to the MVCC machinery, so you can get the locator(s) within a transaction, commit that transaction, and later if you read from the locator, you are still AS OF that transaction (i.e. read-consistent). That's super handy for lazy-loading the data in the app on demand. You of course expose yourself to "snapshot too old". We missed that dearly in PostgreSQL. Could emulate it with an explicit SNAPSHOT left open, not nearly as convenient. And there are specialized APIs that allow operating on *multiple* LOBs in a single server round-trip, which was essential for performance, for smaller ones. In Oracle, anything above 32KB had to be a LOB (at the time), yet you don't want to do a round-trip for each and every 32KB chunk of data, when you have thousands like load. (unless you shard yourself "manually", but good luck matching the perf of SecureFile LOBS) I'm not privy of how they were implemented internally. But I do believe the old blobs they were replacing (at the time) where more like LO is, i.e. handled (mostly) like the other datatypes, in the table infrastructure; while these new (in v11) blobs were handled internally completely differently, more in a file-system manner. Thus the large performance gap between the OLD and NEW Oracle LOBs. But from the outside, that's an "implementation detail". They were in the DB, transactional, value-like (modulo the level of indirection for random access), and importantly, efficient. I really wish PostgreSQL had an equivalent. There's apparently an Oracle ACE on this list, so you can fill in the gaps above, or correct any falsehoods I wrote above. This is over 10 years old, so.... I was a big fan of another ACE, Tom Kyte, whose books helped me a lot, and I was neck-deep in OCI for a few years, but I was just a mostly-self-taught Oracle developer, so definitely not an expert like an ACE. FWIW, --DD