Thread: How to store "blobs" efficiently for small and large sizes, with random access

How to store "blobs" efficiently for small and large sizes, with random access

From
Dominique Devienne
Date:
Hi. I'd like some advice storing blobs (millions of them), ranging
from very small, to large > 1GB.

I know about bytea versus lo, and I have probably read most of what's
out there about them :)

Upfront, I have to state that I'm not keen on lo, because of security
considerations. We store
blobs in many different schemas, and users can access some schemas,
and not others. So
the fact the lo table is unique for the whole database would allow
users to see blobs from any
schema, as I understand it. Right? OTOH, lo has random access, which I
also need...
(I'm also not a fan of lo needing triggers for lifetime management)
(nor of the requirement to have an explicit transaction to use lo).

Here are the main requirement I need to fulfil:
1) store literally millions of rows, 1 "blob" per row. (scientific data).
2) most "blobs" are smallish, below a few KBs. bytea's perfect for that.
3) yet many blobs are on the dozens of MBs. bytea's still OK.
4) but some blobs exceed the 1GB byte limit. Found at least a dozen
just in our test data, so clients will have them.
5) for accessing larger blobs, the API I must implement accesses
contiguous chunks of the blobs. Thus I need random access.

I'm porting the backend of that API from Oracle to PostgreSQL.
In Oracle we used RAW for smaller values, and SecureFile blobs for larger ones.
Oracle LOBs are similar to lo I guess, providing random access, except
higher performance for large sizes,
but slower than bytea for smaller one. But a PITA to deal with, with
locators, breaking pre-fetching, etc...

PostgreSQL bytea is much better and simpler, except limited to 1GB...
Bytea also has no direct random access, except via substr[ing], but
how efficient and "random access" is that?
For example, SQLite stores large values in overflow pages, and has to
traverse all those pages to pointer-chase
the chain of those for large blobs, so the substr() "random access" is
NOT O(1) and involves way too much IO (i.e. O(N)).
Do TOAST'ed large values (text or bytea) fare better in term of
substr() efficiency, close to O(logN) and only
the necessary IO for what's actually requested by the substr() (modulo
the extrema 2K chunks of TOAST)?

Some posts on the internet also peg lo to be faster above 20MB
compared to bytea, for example. Is that even true?
One post I read (from Daniel Verite I think) kinda hinted the lo table
is not that different from TOAST ones,
with 2K chunks, thus I don't see why lo would be any faster than a
TOAST'ed bytea for example.
Any insights from someone with knowledge of the internals can share on this?

At this point, I have not yet redone the extensive benchmarking we did
a few years ago between
Oracle and PostgreSQL, which went into lo versus bytea then. And
benchmark are hard to do, not
having easy access to different kind of servers with different storage
backends, or cloud-hosted PG.
It's too easy to get biais from a local setup, leading to a design
that'd not perform optimally in a different one.
That's why I'm asking a more open-ended question to experts on this list.

Because of #4 above, I need to either use lo (but see above, notably
the security concern),
or roll-up my own TOAST^2 (squared), as I call it, where I manually
"shard" / chunk large blobs
in an auxiliary table, itself TOAST'ed of course, with some threshold
for chunks (e.g. 1 or 4 or 16 MB perhaps).
The latter keeps the blobs in the schema (good for security), lifetime
is managed by FKs (as usual),
and the sharding limits the ill-effects of "emulating" random-access
with substr() if necessary.
I've already done things like this in SQLite land (also has the 1GB
limit for its text and blob types).

So is this a terrible idea? What alternatives do I have?
I'd really appreciate some expert advice on the above, before I go too
far down the rabbit hole.

Thanks, --DD

PS: Another drawback of lo is that because it's a single table, it
still subject to the 32 TB limit on a relation.
  The 4TB limit per lo is way more than we need, but the 32 TB limit
may actually be more of an issue for
  our larger clients, which have thousands of projects, each with
upwards of a few millions of those blobs.
  bytea values being stored in different schemas (per-project, a
design constraint), puts that limit per-project
  which will be more than enough. For the sum of all projects, maybe
not... I.e. with real client-case of 3K projects,
  that puts an average of only 10GB of lo's per-project (i.e. schema),
which could very well be problematic...



Sv: How to store "blobs" efficiently for small and large sizes, with random access

From
Andreas Joseph Krogh
Date:
First advice, don't do it. We started off storing blobs in DB for “TX safety”, but backup/restore quickly became too cumbersome so we ended up moving all blobs out and only store reference in DB. This required us to make a “vacuum system” that cleans up the blob-storage regularly as ROLLBACK/crash can make it out of sync.
 
We chose storing as LO because with it, streaming large blobs (not using much memory) actually worked, with JDBC at least.
 
På onsdag 19. oktober 2022 kl. 11:47:59, skrev Dominique Devienne <ddevienne@gmail.com>:
Hi. I'd like some advice storing blobs (millions of them), ranging
from very small, to large > 1GB.

I know about bytea versus lo, and I have probably read most of what's
out there about them :)

Upfront, I have to state that I'm not keen on lo, because of security
considerations. We store
blobs in many different schemas, and users can access some schemas,
and not others. So
the fact the lo table is unique for the whole database would allow
users to see blobs from any
schema, as I understand it. Right? OTOH, lo has random access, which I
also need...
(I'm also not a fan of lo needing triggers for lifetime management)
(nor of the requirement to have an explicit transaction to use lo).

Here are the main requirement I need to fulfil:
1) store literally millions of rows, 1 "blob" per row. (scientific data).
2) most "blobs" are smallish, below a few KBs. bytea's perfect for that.
3) yet many blobs are on the dozens of MBs. bytea's still OK.
4) but some blobs exceed the 1GB byte limit. Found at least a dozen
just in our test data, so clients will have them.
5) for accessing larger blobs, the API I must implement accesses
contiguous chunks of the blobs. Thus I need random access.

I'm porting the backend of that API from Oracle to PostgreSQL.
In Oracle we used RAW for smaller values, and SecureFile blobs for larger ones.
Oracle LOBs are similar to lo I guess, providing random access, except
higher performance for large sizes,
but slower than bytea for smaller one. But a PITA to deal with, with
locators, breaking pre-fetching, etc...

PostgreSQL bytea is much better and simpler, except limited to 1GB...
Bytea also has no direct random access, except via substr[ing], but
how efficient and "random access" is that?
For example, SQLite stores large values in overflow pages, and has to
traverse all those pages to pointer-chase
the chain of those for large blobs, so the substr() "random access" is
NOT O(1) and involves way too much IO (i.e. O(N)).
Do TOAST'ed large values (text or bytea) fare better in term of
substr() efficiency, close to O(logN) and only
the necessary IO for what's actually requested by the substr() (modulo
the extrema 2K chunks of TOAST)?

Some posts on the internet also peg lo to be faster above 20MB
compared to bytea, for example. Is that even true?
One post I read (from Daniel Verite I think) kinda hinted the lo table
is not that different from TOAST ones,
with 2K chunks, thus I don't see why lo would be any faster than a
TOAST'ed bytea for example.
Any insights from someone with knowledge of the internals can share on this?

At this point, I have not yet redone the extensive benchmarking we did
a few years ago between
Oracle and PostgreSQL, which went into lo versus bytea then. And
benchmark are hard to do, not
having easy access to different kind of servers with different storage
backends, or cloud-hosted PG.
It's too easy to get biais from a local setup, leading to a design
that'd not perform optimally in a different one.
That's why I'm asking a more open-ended question to experts on this list.

Because of #4 above, I need to either use lo (but see above, notably
the security concern),
or roll-up my own TOAST^2 (squared), as I call it, where I manually
"shard" / chunk large blobs
in an auxiliary table, itself TOAST'ed of course, with some threshold
for chunks (e.g. 1 or 4 or 16 MB perhaps).
The latter keeps the blobs in the schema (good for security), lifetime
is managed by FKs (as usual),
and the sharding limits the ill-effects of "emulating" random-access
with substr() if necessary.
I've already done things like this in SQLite land (also has the 1GB
limit for its text and blob types).

So is this a terrible idea? What alternatives do I have?
I'd really appreciate some expert advice on the above, before I go too
far down the rabbit hole.

Thanks, --DD

PS: Another drawback of lo is that because it's a single table, it
still subject to the 32 TB limit on a relation.
  The 4TB limit per lo is way more than we need, but the 32 TB limit
may actually be more of an issue for
  our larger clients, which have thousands of projects, each with
upwards of a few millions of those blobs.
  bytea values being stored in different schemas (per-project, a
design constraint), puts that limit per-project
  which will be more than enough. For the sum of all projects, maybe
not... I.e. with real client-case of 3K projects,
  that puts an average of only 10GB of lo's per-project (i.e. schema),
which could very well be problematic...



 

 

 

--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Dominique Devienne
Date:
On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
First advice, don't do it. We started off storing blobs in DB for “TX safety”

Not really an option, I'm afraid.
 
, but backup/restore quickly became too cumbersome so we ended up moving all blobs out and only store reference in DB.
This required us to make a “vacuum system” that cleans up the blob-storage regularly as ROLLBACK/crash can make it out of sync.

Note the fact the data is spread in many mostly independent schemas.
Might ease the backup/restore? I'm not much of a DBA though...
 
We chose storing as LO because with it, streaming large blobs (not using much memory) actually worked, with JDBC at least.

I'm in C++, with I believe efficient use of binary binds and results, and use of COPY as much as possible,
so as good as it gets I think (that I know of, with libpq), in terms of performance. Haven't looked at libpq's
new in v14 pipeline mode yet though.

Investigated Cursor vs Statement too, and it's a tradeoff between latency and throughput.
Not sure how good JDBC drivers for PostgreSQL are. I'm doing my best to go fast with libpq.

In any case, thanks for your input. But it's not really a question of "if". But of "how".

Putting thousands of large blobs in the file system is a no go. Assuming the clients
can even see the file system the server sees. This is a 2-tier system, there's no mid-tier
that would somehow magically handle proper security and lifetime management of these blobs.

Thanks, --DD

Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Andreas Joseph Krogh
Date:
På onsdag 19. oktober 2022 kl. 12:48:24, skrev Dominique Devienne <ddevienne@gmail.com>:
On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
 
First advice, don't do it. We started off storing blobs in DB for “TX safety”
 
Not really an option, I'm afraid.
 
, but backup/restore quickly became too cumbersome so we ended up moving all blobs out and only store reference in DB.
This required us to make a “vacuum system” that cleans up the blob-storage regularly as ROLLBACK/crash can make it out of sync.
 
Note the fact the data is spread in many mostly independent schemas.
Might ease the backup/restore? I'm not much of a DBA though...
 
 
We chose storing as LO because with it, streaming large blobs (not using much memory) actually worked, with JDBC at least.
 
 
I'm in C++, with I believe efficient use of binary binds and results, and use of COPY as much as possible,
so as good as it gets I think (that I know of, with libpq), in terms of performance. Haven't looked at libpq's
new in v14 pipeline mode yet though.
 
Investigated Cursor vs Statement too, and it's a tradeoff between latency and throughput.
Not sure how good JDBC drivers for PostgreSQL are. I'm doing my best to go fast with libpq.
 
In any case, thanks for your input. But it's not really a question of "if". But of "how".
 
Putting thousands of large blobs in the file system is a no go. Assuming the clients
can even see the file system the server sees. This is a 2-tier system, there's no mid-tier
that would somehow magically handle proper security and lifetime management of these blobs.
 
Thanks, --DD

Ok, just something to think about; Will your database grow beyond 10TB with blobs? If so try to calculate how long it takes to restore, and comply with SLA, and how long it would have taken to restore without the blobs.

 

PS: Our blobstore is not “the file system”, but SeaweedFS.

 

--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Dominique Devienne
Date:
On Wed, Oct 19, 2022 at 1:00 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
> Ok, just something to think about;

Thank you. I do appreciate the feedback.

> Will your database grow beyond 10TB with blobs?

The largest internal store I've seen (for the subset of data that goes
in the DB) is shy of 3TB.
But we are an ISV, not one of our clients, which have truly massive
scale for data.
And they don't share the exact scale of their proprietary data with me...

> If so try to calculate how long it takes to restore, and comply with SLA,
> and how long it would have taken to restore without the blobs.

Something I don't quite get is why somehow backup is no longer needed
if the large blobs are external?
i.e. are you saying backups are so much more worse in PostgreSQL than
with the FS? I'm curious now.

Also, managing the PostgreSQL server will be the client's own concern
mostly. We are not into Saas here.
As hinted above, the truly massive data is already not in the DB, used
by different systems, and processed
down to the GB sized inputs all the data put in the DB is generated
from. It's a scientific data heavy environment.
And one where security of the data is paramount, for contractual and
legal reasons. Files make that harder IMHO.

Anyways, this is straying from the main theme of this post I'm afraid.
Hopefully we can come back on the main one too. --DD



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Andreas Joseph Krogh
Date:
På onsdag 19. oktober 2022 kl. 13:21:38, skrev Dominique Devienne <ddevienne@gmail.com>:
On Wed, Oct 19, 2022 at 1:00 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
> Ok, just something to think about;

Thank you. I do appreciate the feedback.

> Will your database grow beyond 10TB with blobs?

The largest internal store I've seen (for the subset of data that goes
in the DB) is shy of 3TB.
But we are an ISV, not one of our clients, which have truly massive
scale for data.
And they don't share the exact scale of their proprietary data with me...

> If so try to calculate how long it takes to restore, and comply with SLA,
> and how long it would have taken to restore without the blobs.

Something I don't quite get is why somehow backup is no longer needed
if the large blobs are external?
i.e. are you saying backups are so much more worse in PostgreSQL than
with the FS? I'm curious now.

I'm not saying you don't need backup (or redundancy) of other systems holding blobs, but moving them out of RDBMS makes you restore the DB to a consistent state, and able to serve clients, faster. In my experience It's quite unlikely that your (redundant) blob-store needs crash-recovery at the same time you DB does. The same goes with PITR, needed because of some logical error (like client deleted some data they shouldn't have), which is much faster without blobs in DB and doesn't affect the blobstore at all (if you have a smart insert/update/delete-policy there).

 

Also, managing the PostgreSQL server will be the client's own concern
mostly. We are not into Saas here.
As hinted above, the truly massive data is already not in the DB, used
by different systems, and processed
down to the GB sized inputs all the data put in the DB is generated
from. It's a scientific data heavy environment.
And one where security of the data is paramount, for contractual and
legal reasons. Files make that harder IMHO.

Anyways, this is straying from the main theme of this post I'm afraid.
Hopefully we can come back on the main one too. --DD

There's a reason “everybody” advices to move blobs out of DB, I've learned.

 

--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Dominique Devienne
Date:
On Wed, Oct 19, 2022 at 1:38 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
There's a reason “everybody” advices to move blobs out of DB, I've learned.

I get that. I really do. But the alternative has some real downsides too.
Especially around security, as I already mentioned. That's why I'd like if possible
to get input on the technical questions of my initial post.

That's not to say we wouldn't ultimately move out the big blobs outside the DB.
But given how much that would complexify the project, I do believe it is better
to do it as a second step, once the full system is up-and-running and testing at
scale has actually been performed.

We've already moved other kind of data to PostgreSQL, from SQLite DBs (thousands) this time,
and ported "as-is" the sharding done on the SQLite side to PostgreSQL (despite TOAST).
And so far, so good. With good ingestion rates. And decent runtime access to data too,
in the albeit limited testing we've had so far.

Now we need to move this other kind of data, from proprietary DB-like files this times (thousands too),
to finish our system, and be able to finally test the whole system in earnest, and at (our limited internal) scale.

So you see, I'm not completely ignoring your advise.

But for now, I'm inquiring as to the *best* way to put that data *in* PostgreSQL,
with the requirements / constraints I've listed in the first post.
It may indeed be a bad idea long term. But let's make the most of it for now.
Makes sense? Am I being unreasonable here? --DD

Re: How to store "blobs" efficiently for small and large sizes, with random access

From
esconsult1@gmail.com
Date:
We had the same thought of storing the blobs inside LO’s as well many years ago.

But ultimately chose cloud storage and stored a pointer in the database instead.

Now that we are approaching a terabyte of just normal data I don’t regret this decision one bit. Just handling backups and storage is already a chore. 

Data in S3 compatible storage is very easy to protect in numerous ways.

We have one set of code responsible for uploading, downloading and deleting the files themselves.

One downside? Occasionally an S3 delete fails and now and again a file or two gets orphaned. But we’ve never not found a file pointed to from our attachments table in 11 years.

We also only store pathnames/base names so we can easily move storage providers if we decide to go on Prem.

There is absolutely no upside to storing files in the db if you anticipate any kind of growth or significant volume.

Ericson Smith
CTO
Travel Agency Tribes

Sent from my iPhone

On 19 Oct 2022, at 7:01 PM, Dominique Devienne <ddevienne@gmail.com> wrote:


On Wed, Oct 19, 2022 at 1:38 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
There's a reason “everybody” advices to move blobs out of DB, I've learned.

I get that. I really do. But the alternative has some real downsides too.
Especially around security, as I already mentioned. That's why I'd like if possible
to get input on the technical questions of my initial post.

That's not to say we wouldn't ultimately move out the big blobs outside the DB.
But given how much that would complexify the project, I do believe it is better
to do it as a second step, once the full system is up-and-running and testing at
scale has actually been performed.

We've already moved other kind of data to PostgreSQL, from SQLite DBs (thousands) this time,
and ported "as-is" the sharding done on the SQLite side to PostgreSQL (despite TOAST).
And so far, so good. With good ingestion rates. And decent runtime access to data too,
in the albeit limited testing we've had so far.

Now we need to move this other kind of data, from proprietary DB-like files this times (thousands too),
to finish our system, and be able to finally test the whole system in earnest, and at (our limited internal) scale.

So you see, I'm not completely ignoring your advise.

But for now, I'm inquiring as to the *best* way to put that data *in* PostgreSQL,
with the requirements / constraints I've listed in the first post.
It may indeed be a bad idea long term. But let's make the most of it for now.
Makes sense? Am I being unreasonable here? --DD

Re: How to store "blobs" efficiently for small and large sizes, with random access

From
"Daniel Verite"
Date:
    Dominique Devienne wrote:

> the fact the lo table is unique for the whole database would allow
> users to see blobs from any schema, as I understand it.

Direct access to pg_largeobject is only possible for superusers.
If lo_compat_privileges is on, any user can read any large
object with the lo* functions.
If it's off, they can read a large object only if they're the owner
or they have been granted permissions with

GRANT { { SELECT | UPDATE } [, ...] | ALL [ PRIVILEGES ] }
    ON LARGE OBJECT loid [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]
    [ GRANTED BY role_specification ]

Each large object has its own set of permissions. This is a significant
difference with bytea, since every creation of a new large object
may need to be followed by GRANT statements.
Also if the roles and the access policies are changed in the
lifetime of the app, that might imply massive REVOKE/GRANT
statements to apply to existing objects.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Alvaro Herrera
Date:
On 2022-Oct-19, Dominique Devienne wrote:

> Upfront, I have to state that I'm not keen on lo, because of security
> considerations. We store blobs in many different schemas, and users
> can access some schemas, and not others. So the fact the lo table is
> unique for the whole database would allow users to see blobs from any
> schema, as I understand it. Right? OTOH, lo has random access, which I
> also need...

Generally speaking, bytea sucks for random access, because if a TOAST
item is compressed, it has to be always read from the beginning in order
to decompress correctly.  However, if you set
ALTER TABLE tab ALTER COLUMN byt SET STORAGE EXTERNAL
then compression is not used, and random access becomes fast.

https://www.postgresql.org/docs/15/sql-altertable.html

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Dominique Devienne
Date:
On Wed, Oct 19, 2022 at 3:01 PM Daniel Verite <daniel@manitou-mail.org> wrote:
>         Dominique Devienne wrote:
> > the fact the lo table is unique for the whole database would allow
> > users to see blobs from any schema, as I understand it.

> Each large object has its own set of permissions. This is a significant
> difference with bytea, since every creation of a new large object
> may need to be followed by GRANT statements.
> Also if the roles and the access policies are changed in the
> lifetime of the app, that might imply massive REVOKE/GRANT
> statements to apply to existing objects.

Thank you Daniel. Very interesting, and something I definitely didn't know.

I believe that's doable, given our design on ROLEs, but would for sure
be both a PITA, and
additional management / code to deal with. At least GRANTs are
transactional like the new
LO oids themselves, I think, so now I know it would be possible to
properly secure the LOs.

This insight is greatly appreciated. --DD



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Dominique Devienne
Date:
On Wed, Oct 19, 2022 at 3:05 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Oct-19, Dominique Devienne wrote:
> > OTOH, lo has random access, which I also need...
>
> Generally speaking, bytea sucks for random access, because if a TOAST
> item is compressed, it has to be always read from the beginning in order
> to decompress correctly.  However, if you set
> ALTER TABLE tab ALTER COLUMN byt SET STORAGE EXTERNAL
> then compression is not used, and random access becomes fast.

Thank you Álvaro. Quite insightful as well. I was aware of TOAST compression,
but didn't make the connection to the side-effects on random-access.

But now that TOAST has LZ4 support, which decompresses extremely fast,
compared to ZLib (I have experience with LZ4 for a WebSocket-based
server messages),
and choosing an appropriately small shard/chunk size, that might be
mitigated somewhat.
Would need testing / benchmarking to compare uncompressed vs LZ4, at
various chunk
and subset/offset sizes, of course.

Anybody has an answer to my question regarding how substr() works on
bytea values?
I.e. is it "pushed down" / optimized enough that it avoids reading the
whole N-byte value,
to then pass it to substr(), which then returns an M-byte value (where M < N)?

If TOAST stores 2,000 chunks, and those chunks' PKs are numbers,
simple arithmetic
should be able to select only the chunks of interest, those incurring
only the necessary IO
for the selected range, no?

Or the fact subsetting a bytea currently requires substr() prevents
using such a scenario?
And if so, why not support a native subsetting notation that did
support that scenario,
like the obvious bytea_col[offset, count] or bytea_col[start:end]?

Seems to be me efficient native subsetting of varlength values would
be quite valuable.



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
"Daniel Verite"
Date:
    Dominique Devienne wrote:

> PostgreSQL bytea is much better and simpler, except limited to 1GB...
> Bytea also has no direct random access, except via substr[ing], but
> how efficient and "random access" is that?

Bytea contents are compressed before being sliced (in chunks of
TOAST_MAX_CHUNK_SIZE bytes, typically it's 2000 IIRC), so it's not
possible to access a piece of data without decompressing the contents
before it.

By contrast large objects are sliced before compression, so the
performance of random access is likely to be completely different.

> Here are the main requirement I need to fulfil:
> 1) store literally millions of rows, 1 "blob" per row. (scientific data).
> 2) most "blobs" are smallish, below a few KBs. bytea's perfect for that.
> 3) yet many blobs are on the dozens of MBs. bytea's still OK.
> 4) but some blobs exceed the 1GB byte limit. Found at least a dozen
> just in our test data, so clients will have them.
> 5) for accessing larger blobs, the API I must implement accesses
> contiguous chunks of the blobs. Thus I need random access.

In your case I would probably opt for bytea (as opposed to large
objects), and slicing the blobs in the application in chunks of a
fixed size much larger than what TOAST does (for instance, somewhere
between 128 kB and 32 MB).

That is, having a table like:

create table blobs (
  blob_id some_type,
  chunk_no int,  /* 0->N */
  chunk bytea
);

It's not as simple as using a single bytea field or large objects,
but overall it avoids the management difficulties of both large
objects and very large contents in bytea columns.
Random access is achieved by skipping the chunks before the
requested piece of data.

If the app is able to read/write the chunks in binary mode, its queries
should perform as well as the large objects functions.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



On 10/19/22 04:47, Dominique Devienne wrote:
[snip]
> PS: Another drawback of lo is that because it's a single table, it
> still subject to the 32 TB limit on a relation.
>    The 4TB limit per lo is way more than we need, but the 32 TB limit
> may actually be more of an issue for
>    our larger clients, which have thousands of projects, each with
> upwards of a few millions of those blobs.

Partition the table on the first segment of the Primary Key.  From 
experience, anything else can lead to serious query degradation.

-- 
Angular momentum makes the world go 'round.



On 10/19/22 06:38, Andreas Joseph Krogh wrote:
.ck-content { --ck-color-image-caption-background: #f7f7f7; --ck-color-image-caption-text: #333333; --ck-color-mention-background: #990030E6; --ck-color-mention-text: #990030; --ck-color-table-caption-background: #f7f7f7; --ck-color-table-caption-text: #333333; --ck-highlight-marker-blue: #72ccfd; --ck-highlight-marker-green: #62f962; --ck-highlight-marker-pink: #fc7899; --ck-highlight-marker-yellow: #fdfd77; --ck-highlight-pen-green: #128a00; --ck-highlight-pen-red: #e71313; --ck-image-style-spacing: 1.5em; --ck-spacing-large: 2px; --ck-inline-image-style-spacing: calc(var(--ck-image-style-spacing) / 2); --ck-todo-list-checkmark-size: 16px; /* This works when email is READ in Visena-reader, because there the content of the email is in "shadow-DOM", which has it's own :root, but needs to be declared in ".ck-content p, .ck-content div" as well for the actual CKEditor to display it correctly, else it get's its default from .liftTemplate */ font-family: Arial, Helvetica, sans-serif; font-size: 14px; }.ck-content .image.image_resized { max-width: 100%; display: block; box-sizing: border-box; }.ck-content .image.image_resized img { width: 100%; }.ck-content .image.image_resized > figcaption { display: block; }.ck-content .image > figcaption { display: table-caption; caption-side: bottom; word-break: break-word; color: var(--ck-color-image-caption-text); background-color: var(--ck-color-image-caption-background); padding: .6em; font-size: .75em; outline-offset: -1px; }.ck-content p, .ck-content div { font-family: Arial, Helvetica, sans-serif; font-size: 14px; }.ck-content p.visenaNormal { margin: 0; }.ck-content .text-tiny { font-size: .7em; }.ck-content .text-small { font-size: .85em; }.ck-content .text-big { font-size: 1.4em; }.ck-content .text-huge { font-size: 1.8em; }.ck-content .image-style-block-align-left, .ck-content .image-style-block-align-right { max-width: calc(100% - var(--ck-image-style-spacing)); }.ck-content .image-style-align-left, .ck-content .image-style-align-right { clear: none; }.ck-content .image-style-side { float: right; margin-left: var(--ck-image-style-spacing); max-width: 50%; }.ck-content .image-style-align-left { float: left; margin-right: var(--ck-image-style-spacing); }.ck-content .image-style-align-center { margin-left: auto; margin-right: auto; }.ck-content .image-style-align-right { float: right; margin-left: var(--ck-image-style-spacing); }.ck-content .image-style-block-align-right { margin-right: 0; margin-left: auto; }.ck-content .image-style-block-align-left { margin-left: 0; margin-right: auto; }.ck-content p + .image-style-align-left, .ck-content p + .image-style-align-right, .ck-content p + .image-style-side { margin-top: 0; }.ck-content .image-inline.image-style-align-left, .ck-content .image-inline.image-style-align-right { margin-top: var(--ck-inline-image-style-spacing); margin-bottom: var(--ck-inline-image-style-spacing); }.ck-content .image-inline.image-style-align-left { margin-right: var(--ck-inline-image-style-spacing); }.ck-content .image-inline.image-style-align-right { margin-left: var(--ck-inline-image-style-spacing); }.ck-content .image { display: table; clear: both; text-align: center; margin: 0 auto; min-width: 50px; }.ck-content .image img { display: block; margin: 0 auto; max-width: 100%; min-width: 100%; }.ck-content .image-inline { /* * Normally, the .image-inline would have "display: inline-block" and "img { width: 100% }" (to follow the wrapper while resizing).; * Unfortunately, together with "srcset", it gets automatically stretched up to the width of the editing root. * This strange behavior does not happen with inline-flex. */ display: inline-flex; max-width: 100%; align-items: flex-start; }.ck-content .image-inline picture { display: flex; }.ck-content .image-inline picture, .ck-content .image-inline img { flex-grow: 1; flex-shrink: 1; max-width: 100%; }.ck-content .marker-yellow { background-color: var(--ck-highlight-marker-yellow); }.ck-content .marker-green { background-color: var(--ck-highlight-marker-green); }.ck-content .marker-pink { background-color: var(--ck-highlight-marker-pink); }.ck-content .marker-blue { background-color: var(--ck-highlight-marker-blue); }.ck-content .pen-red { color: var(--ck-highlight-pen-red); background-color: transparent; }.ck-content .pen-green { color: var(--ck-highlight-pen-green); background-color: transparent; }.ck-content hr { margin: 15px 0; height: 4px; background: #dedede; border: 0; }.ck-content blockquote { overflow: hidden; padding-right: 0; padding-left: 1ex; margin-left: 0; margin-right: 0; font-style: unset; border-left: solid 1px #cccccc; }.ck-content .blockquote { font-style: unset; }.ck-content[dir="rtl"] blockquote { border-left: 0; border-right: solid 1px #cccccc; }.ck-content code { background-color: #c7c7c7; padding: 0 1px; font-size: small; border-radius: 2px; }.ck-content .table > figcaption { display: table-caption; caption-side: top; word-break: break-word; text-align: center; color: var(--ck-color-table-caption-text); background-color: var(--ck-color-table-caption-background); padding: .6em; font-size: .75em; outline-offset: -1px; }.ck-content .table { margin: 0 auto; display: table; }.ck-content .table table { border-collapse: collapse; border-spacing: 0; width: 100%; height: 100%; border: 1px double #b3b3b3; }.ck-content .table table td, .ck-content .table table th { min-width: 2em; padding: .4em; border: 1px solid #bfbfbf; }.ck-content .table table th { font-weight: bold; background: #000000E6; }.ck-content[dir="rtl"] .table th { text-align: right; }.ck-content[dir="ltr"] .table th { text-align: left; }.ck-content .table { margin-left: 0; }.ck-content .table table { }.ck-content .table table td { }.ck-content .page-break { position: relative; clear: both; padding: 5px 0; display: flex; align-items: center; justify-content: center; }.ck-content .page-break::after { content: ''; position: absolute; border-bottom: 2px dashed #c4c4c4; width: 100%; }.ck-content .page-break__label { position: relative; z-index: 1; padding: .3em .6em; display: block; text-transform: uppercase; border: 1px solid #c4c4c4; border-radius: 2px; font-family: Arial, Helvetica, sans-serif; font-size: 0.75em; font-weight: bold; color: #333333; background: #ffffff; box-shadow: 2px 2px 1px #000000; -webkit-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; }.ck-content .media { clear: both; margin: 0 0; display: block; min-width: 15em; }.ck-content .todo-list { list-style: none; }.ck-content .todo-list li { margin-bottom: 5px; }.ck-content .todo-list li .todo-list { margin-top: 5px; }.ck-content .todo-list .todo-list__label > input { -webkit-appearance: none; display: inline-block; position: relative; width: var(--ck-todo-list-checkmark-size); height: var(--ck-todo-list-checkmark-size); vertical-align: middle; border: 0; left: -25px; margin-right: -15px; right: 0; margin-left: 0; }.ck-content .todo-list .todo-list__label > input::before { display: block; position: absolute; box-sizing: border-box; content: ''; width: 100%; height: 100%; border: 1px solid #333333; border-radius: 2px; transition: 250ms ease-in-out box-shadow, 250ms ease-in-out background, 250ms ease-in-out border; }.ck-content .todo-list .todo-list__label > input::after { display: block; position: absolute; box-sizing: content-box; pointer-events: none; content: ''; left: calc( var(--ck-todo-list-checkmark-size) / 3 ); top: calc( var(--ck-todo-list-checkmark-size) / 5.3 ); width: calc( var(--ck-todo-list-checkmark-size) / 5.3 ); height: calc( var(--ck-todo-list-checkmark-size) / 2.6 ); border-style: solid; border-color: transparent; border-width: 0 calc( var(--ck-todo-list-checkmark-size) / 8 ) calc( var(--ck-todo-list-checkmark-size) / 8 ) 0; transform: rotate(45deg); }.ck-content .todo-list .todo-list__label > input[checked]::before { background: #26ab33; border-color: #26ab33; }.ck-content .todo-list .todo-list__label > input[checked]::after { border-color: #ffffff; }.ck-content .todo-list .todo-list__label .todo-list__label__description { vertical-align: middle; }.ck-content span[lang] { font-style: italic; }.ck-content pre { padding: 1em; color: #353535; background: #c7c7c7; border: 1px solid #c4c4c4; border-radius: 2px; text-align: left; direction: ltr; tab-size: 4; white-space: pre-wrap; font-style: normal; min-width: 200px; }.ck-content pre code { background: unset; padding: 0; border-radius: 0; }.ck-content .mention { background: var(--ck-color-mention-background); color: var(--ck-color-mention-text); }
På onsdag 19. oktober 2022 kl. 13:21:38, skrev Dominique Devienne <ddevienne@gmail.com>:
On Wed, Oct 19, 2022 at 1:00 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
> Ok, just something to think about;

Thank you. I do appreciate the feedback.

> Will your database grow beyond 10TB with blobs?

The largest internal store I've seen (for the subset of data that goes
in the DB) is shy of 3TB.
But we are an ISV, not one of our clients, which have truly massive
scale for data.
And they don't share the exact scale of their proprietary data with me...

> If so try to calculate how long it takes to restore, and comply with SLA,
> and how long it would have taken to restore without the blobs.

Something I don't quite get is why somehow backup is no longer needed
if the large blobs are external?
i.e. are you saying backups are so much more worse in PostgreSQL than
with the FS? I'm curious now.

I'm not saying you don't need backup (or redundancy) of other systems holding blobs, but moving them out of RDBMS makes you restore the DB to a consistent state, and able to serve clients, faster. In my experience It's quite unlikely that your (redundant) blob-store needs crash-recovery at the same time you DB does. The same goes with PITR, needed because of some logical error (like client deleted some data they shouldn't have), which is much faster without blobs in DB and doesn't affect the blobstore at all (if you have a smart insert/update/delete-policy there).


This is nothing to sneeze at.  Backing up a 30TB database takes a long time

 

Also, managing the PostgreSQL server will be the client's own concern
mostly. We are not into Saas here.
As hinted above, the truly massive data is already not in the DB, used
by different systems, and processed
down to the GB sized inputs all the data put in the DB is generated
from. It's a scientific data heavy environment.
And one where security of the data is paramount, for contractual and
legal reasons. Files make that harder IMHO.

Anyways, this is straying from the main theme of this post I'm afraid.
Hopefully we can come back on the main one too. --DD

There's a reason “everybody” advices to move blobs out of DB, I've learned.


We deal with an ISV maintaining a banking application.  It stores scanned images of checks as bytea fields in a Postgresql 9.6 database.  The next version will store the images outside of the database.

 

--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 

--
Angular momentum makes the world go 'round.
Attachment
On Wed, 2022-10-19 at 12:48 +0200, Dominique Devienne wrote:
> On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
> > First advice, don't do it. We started off storing blobs in DB for “TX safety”
> 
> Not really an option, I'm afraid.

You should reconsider.  Ruling out that option now might get you into trouble
later.  Large Objects mean trouble.

Yours,
Laurenz Albe



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Alvaro Herrera
Date:
On 2022-Oct-19, Dominique Devienne wrote:

> Anybody has an answer to my question regarding how substr() works on
> bytea values?  I.e. is it "pushed down" / optimized enough that it
> avoids reading the whole N-byte value, to then pass it to substr(),
> which then returns an M-byte value (where M < N)?
>
> If TOAST stores 2,000 chunks, and those chunks' PKs are numbers,
> simple arithmetic should be able to select only the chunks of
> interest, those incurring only the necessary IO for the selected
> range, no?

That's exactly what I was trying to say.  If there's no compression, we
don't read prior chunks.  (This is valid for bytea, at least; for
textual types we have to worry about multibyte characters, which are
again a potential source of confusion regarding the exact location you
want to seek.)

This can be seen in detoast_attr_slice() in
src/backend/access/common/detoast.c, though there are way too many^W^W^W
multiple layers of indirection if you start from bytea_substr() in
varlena.c.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Dominique Devienne
Date:
On Wed, Oct 19, 2022 at 4:29 PM Daniel Verite <daniel@manitou-mail.org> wrote:
> In your case I would probably opt for bytea (as opposed to large
> objects), and slicing the blobs in the application in chunks

Thanks for the advice, and the valuable info on LO permissions. --DD



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Dominique Devienne
Date:
On Wed, Oct 19, 2022 at 5:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> That's exactly what I was trying to say.
> If there's no compression, we don't read prior chunks.

Great to read that. I'll probably try to benchmark w/ and w/o
compression eventually.
Need to deal with other issues first, will take a while to report back on this.

> This can be seen in detoast_attr_slice()

Thank you for the pointer. I'll be sure to have a look at that code.
And thanks again for chiming in, with very useful info Alvaro.



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Dominique Devienne
Date:
On Wed, Oct 19, 2022 at 5:05 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> On Wed, 2022-10-19 at 12:48 +0200, Dominique Devienne wrote:
> > On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
> > > First advice, don't do it. We started off storing blobs in DB for “TX safety”
> > Not really an option, I'm afraid.
> You should reconsider.  Ruling out that option now might get you into trouble
> later.  Large Objects mean trouble.

Andreas, Ericson, Laurenz, thanks for the advice.
I'll be sure to discuss these concerns with the team.

We have other (bigger) data in the file system, albeit more of a
read-only nature though perhaps.
And this is an area I'm not familiar with how security is handled, so
I'll investigate it to see if a path
forward to externalize the largish blobs (currently destined to live
in the DB) is possible.
So I hope you can see I'm not dismissing what you guys are saying.

But before I finish this thread for now, I'd like to add that I
consider unfortunate a state of affairs where
NOT putting the data in the DB is the mostly agreed upon advice. It
IMHO points to a weak point of
PostgreSQL, which does not invest in those use-cases with large data,
perhaps with more file-system
like techniques. Probably because most of the large users of
PostgreSQL are more on the "business"
side (numerous data, but on the smaller sizes) than the "scientific"
side, which (too often) uses files and
files-in-a-file formats like HDF5.

FWIW, when Oracle introduced SecureFile blobs years ago in v11, it
represented a leap forward in
performance, and back then we were seeing them being 3x faster than LO
at GB sizes, if I recall correctly,
with throughput that challenged regular networked file-system like
NFS. That was over 10 years ago,
so who knows where we are now. And from the posts here, the issues
with large blobs may be more
related to backup/restore perhaps, than runtime performance.

Having all the data in the DB, under a single security model, is a big
win for consistency and simplicity.
And the fact it's not really possible now is a pity, in my mind. My
(probably uninformed) opinion on this
is the large blobs are handled just like other relational data, in
paged storage designed for smaller data.
I.e. file-like blobs are shoehorned into structures which are
inappropriate for them, and that a rethink
and redesign is necessary specifically for them, similar to the Oracle
SecureFile one of old.

I have similar gripes with SQLite, which is otherwise a fantastic
embedded DB. Just see how the
SQLite-based Fossil-SCM fails to scale for very large repo with big
(e.g. game) assets, and how it
similarly failed to scale in SVN a long time ago, to be replaced by a
forest-of-files (which GIT also uses).

DBs like PostgreSQL and SQLite should be better at this. And I hope
they get there eventually.
Sorry to turn a bit philosophical at this. It's not a critic per-se.
More of the personal musing of a
dev in this space for a long time. FWIW. Thanks, --DD



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Andreas Joseph Krogh
Date:
På torsdag 20. oktober 2022 kl. 10:32:44, skrev Dominique Devienne <ddevienne@gmail.com>:
On Wed, Oct 19, 2022 at 5:05 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> On Wed, 2022-10-19 at 12:48 +0200, Dominique Devienne wrote:
> > On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
> > > First advice, don't do it. We started off storing blobs in DB for “TX safety”
> > Not really an option, I'm afraid.
> You should reconsider.  Ruling out that option now might get you into trouble
> later.  Large Objects mean trouble.

Andreas, Ericson, Laurenz, thanks for the advice.
I'll be sure to discuss these concerns with the team.
[…]
But before I finish this thread for now, I'd like to add that I
consider unfortunate a state of affairs where
NOT putting the data in the DB is the mostly agreed upon advice. It
IMHO points to a weak point of
PostgreSQL, which does not invest in those use-cases with large data,
perhaps with more file-system
like techniques. Probably because most of the large users of
PostgreSQL are more on the "business"
side (numerous data, but on the smaller sizes) than the "scientific"
side, which (too often) uses files and
files-in-a-file formats like HDF5.
[…]

 

Note that my views were not PG-specific and applies to all applications/architectures involving RDBMS.
From my point of view having all data in RDBMS is (maybe) theoretically sound, but given that IO is not instant I consider it a design-flaw, for some reasons which I've already pointed out.

 

--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment
On Thu, 2022-10-20 at 10:32 +0200, Dominique Devienne wrote:
> But before I finish this thread for now, I'd like to add that I
> consider unfortunate a state of affairs where
> NOT putting the data in the DB is the mostly agreed upon advice. It
> IMHO points to a weak point of
> PostgreSQL, which does not invest in those use-cases with large data,
> perhaps with more file-system
> like techniques.

Relational databases are general-purpose software, but that doesn't
mean that they are the right tool for everything.  And storing large
files is not what they are good at.  File systems are the right tool
here.

> FWIW, when Oracle introduced SecureFile blobs years ago in v11, it
> represented a leap forward in
> performance, and back then we were seeing them being 3x faster than LO
> at GB sizes, if I recall correctly,

I don't know what exactly they are, but I suspect that they are just
files (segments?) in Oracle's "file system" (tablespaces/datafiles).
So pretty much what we recommend.

Yours,
Laurenz Albe




> I don't know what exactly they are, but I suspect that they are just
> files (segments?) in Oracle's "file system" (tablespaces/datafiles).
> So pretty much what we recommend.

Maybe so, but if those large segments are presented "seamlessly"
in the form of a table integrated with PGs access/security infrastructure
that would be really helpful for some scenarios.

A view-on-top-of-file_fdw kind of thing ?

LO seems to nearly be there by now, or am I misunderstanding ?

Karsten




On 10/20/22 03:32, Dominique Devienne wrote:
[snip]
> And from the posts here, the issues
> with large blobs may be more
> related to backup/restore perhaps, than runtime performance.

 From my long experience as a first a programmer and then a database 
administrator, backups and archiving are at the very bottom of programmers' 
priority list, whereas they're near the top of a DBA's priority list.

> Having all the data in the DB, under a single security model, is a big
> win for consistency and simplicity.

No doubt.

-- 
Angular momentum makes the world go 'round.



Re: How to store "blobs" efficiently for small and large sizes, with random access

From
Dominique Devienne
Date:
On Thu, Oct 20, 2022 at 12:21 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> On Thu, 2022-10-20 at 10:32 +0200, Dominique Devienne wrote:
> > FWIW, when Oracle introduced SecureFile blobs years ago in v11, it
> > represented a leap forward in
> > performance, and back then we were seeing them being 3x faster than LO
> > at GB sizes, if I recall correctly,
>
> I don't know what exactly they are, but I suspect that they are just
> files (segments?) in Oracle's "file system" (tablespaces/datafiles).
> So pretty much what we recommend.

I don't think so :) Pretty much the opposite in fact.

Don't let the term SecureFile foul you. From the DB client's perspective,
they are a server-side value like any other, living in a row/col
(relation/tuple),
pretty much like a BYTEA value. But unlike BYTEA, and like LO, what you
SELECT or UPDATE is a LOB Locator, not the content itself, then used
in separate APIs,
so very much like LO. So you get the best of both BYTEA (acts like a
value, in a tab/col),
and LO (random access).

Fully transactional. In fact, the LOB locator in a handle to the MVCC machinery,
so you can get the locator(s) within a transaction, commit that transaction, and
later if you read from the locator, you are still AS OF that
transaction (i.e. read-consistent).
That's super handy for lazy-loading the data in the app on demand. You of course
expose yourself to "snapshot too old". We missed that dearly in PostgreSQL.
Could emulate it with an explicit SNAPSHOT left open, not nearly as convenient.

And there are specialized APIs that allow operating on *multiple* LOBs
in a single
server round-trip, which was essential for performance, for smaller
ones. In Oracle,
anything above 32KB had to be a LOB (at the time), yet you don't want to do a
round-trip for each and every 32KB chunk of data, when you have
thousands like load.
(unless you shard yourself "manually", but good luck matching the perf
of SecureFile LOBS)

I'm not privy of how they were implemented internally. But I do
believe the old blobs
they were replacing (at the time) where more like LO is, i.e. handled
(mostly) like
the other datatypes, in the table infrastructure; while these new (in v11) blobs
were handled internally completely differently, more in a file-system manner.
Thus the large performance gap between the OLD and NEW Oracle LOBs.

But from the outside, that's an "implementation detail". They were in the DB,
transactional, value-like (modulo the level of indirection for random access),
and importantly, efficient. I really wish PostgreSQL had an equivalent.

There's apparently an Oracle ACE on this list, so you can fill in the
gaps above,
or correct any falsehoods I wrote above. This is over 10 years old, so....
I was a big fan of another ACE, Tom Kyte, whose books helped me a lot,
and I was neck-deep in OCI for a few years, but I was just a mostly-self-taught
Oracle developer, so definitely not an expert like an ACE. FWIW, --DD