Thread: Storing large files in multiple schemas: BLOB or BYTEA

Storing large files in multiple schemas: BLOB or BYTEA

From

Date:

10 October 2012, 10:17:04

Hi,

I need to store large files (from several MB to 1GB) in Postgres database. The database has multiple schemas. It looks like Postgres has 2 options to store large objects: LOB and BYTEA. However we seem to hit problems with each of these options.

1. LOB. This works almost ideal, can store up to 2GB and allows streaming so that we do not hit memory limits in our PHP backend when reading the LOB. However all blobs are stored in pg_catalog and are not part of schema. This leads to a big problem when you try to use pg_dump with options –n and –b to dump just one schema with its blobs. It dumps the schema data correctly however then it include ALL blobs in the database not just the blobs that belong to the particular schema.

Is there a way to dump the single schema with its blobs using pg_dump or some other utility?

2. BYTEA. These are correctly stored per schema so pg_dump –n works correctly however I cannot seem to find a way to stream the data. This means that there is no way to access the data from PHP if it is larger than memory limit.

Is there any other way to store large data in Postgres that allows streaming and correctly works with multiple schemas per database?

Thanks.

(Sorry if this double-posts on pgsql-php, I did not know which is the best list for this question).

Re: Storing large files in multiple schemas: BLOB or BYTEA

From

Shaun Thomas

Date:

10 October 2012, 12:56:25

On 10/10/2012 05:16 AM, tigran2-postgres@riatest.com wrote:

> I need to store large files (from several MB to 1GB) in Postgres
> database. The database has multiple schemas. It looks like Postgres
> has 2 options to store large objects: LOB and BYTEA. However we seem
> to hit problems with each of these options.

I believe the general consensus around here is to not do that, if you
can avoid it. File systems are much better equipped to handle files of
that magnitude, especially when it comes to retrieving them, scanning
through their contents, or really, any access pattern aside from simple
storage.

You're better off storing the blob on disk somewhere and storing a row
that refers to its location. Either key pieces for a naming scheme or
the full path.

This is especially true if you mean to later access that data with PHP.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

Re: Storing large files in multiple schemas: BLOB or BYTEA

From

"Albe Laurenz"

Date:

10 October 2012, 13:34:15

tigran2-postgres@riatest.com wrote:
> Is there any other way to store large data in Postgres that allows
streaming and correctly works with
> multiple schemas per database?

Large Objects and bytea are the only ways.

If you want to pg_dump only certain large objects, that won't work
as far as I know (maybe using permissions and a non-superuser can help).

You absolutely need to pg_dump parts of the database regularly?

Yours,
Laurenz Albe

Re: Storing large files in multiple schemas: BLOB or BYTEA

From

Date:

11 October 2012, 05:22:19

>Large Objects and bytea are the only ways.

>If you want to pg_dump only certain large objects, that won't work as far as I know (maybe using permissions and a non-superuser can help).

>You absolutely need to pg_dump parts of the database regularly?

>Yours,

>Laurenz Albe

It is not an absolute requirement but would be really nice to have. We have a multi-tenant database with each tenant data stored in a separate scheme. Using pg_dump seems to be the ideal way to migrate tenant data from one database to another when we need to do it to balance the load.

Re: Storing large files in multiple schemas: BLOB or BYTEA

From

Date:

11 October 2012, 05:36:04

>I believe the general consensus around here is to not do that, if you can avoid it. File systems are much better equipped to handle files of that magnitude, especially when it comes to retrieving them, scanning >through their contents, or really, any access pattern aside from simple storage.

>You're better off storing the blob on disk somewhere and storing a row that refers to its location. Either key pieces for a naming scheme or the full path.

>This is especially true if you mean to later access that data with PHP.

>--

>Shaun Thomas

Using files stored outside the database creates all sorts of problems. For starters you lose ACID guaranties. I would prefer to keep them in database. We did a lot of experiments with Large Objects and they really worked fine (stored hundreds of LOBs ranging from a few MB up to 1GB). Postgres does a really good job with Large Objects. If it was not the pg_dump problem I would not hesitate to use LOBs.

Re: Storing large files in multiple schemas: BLOB or BYTEA

From

Craig Ringer

Date:

11 October 2012, 05:56:21

On 10/11/2012 01:35 PM, tigran2-postgres@riatest.com wrote:
> Using files stored outside the database creates all sorts of problems.
> For starters you lose ACID guaranties. I would prefer to keep them in
> database. We did a lot of experiments with Large Objects and they really
> worked fine (stored hundreds of LOBs ranging from a few MB up to 1GB).
> Postgres does a really good job with Large Objects. If it was not the
> pg_dump problem I would not hesitate to use LOBs.

Yeah, a pg_dump mode that dumped everything but large objects would be
nice.

Right now I find storing large objects in the DB such a pain from a
backup management point of view that I avoid it where possible.

I'm now wondering about the idea of implementing a pg_dump option that
dumped large objects into a directory tree like
   lobs/[loid]/[lob_md5]
and wrote out a restore script that loaded them using `lo_import`.

During dumping temporary copies could be written to something like
lobs/[loid]/.tmp.nnnn with the md5 being calculated on the fly as the
byte stream is read. If the dumped file had the same md5 as the existing
one it'd just delete the tempfile; otherwise the tempfile would be
renamed to the calculated md5.

That way incremental backup systems could manage the dumped LOB tree
without quite the same horrible degree of duplication as is currently
faced when using lo in the database with pg_dump.

A last_modified timestamp on `pg_largeobject_metadata` would be even
better, allowing the cost of reading and discarding rarely-changed large
objects to be avoided.

--
Craig Ringer

Re: Storing large files in multiple schemas: BLOB or BYTEA

From

Chris Travers

Date:

11 October 2012, 09:34:20

On Wed, Oct 10, 2012 at 10:56 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:

On 10/11/2012 01:35 PM, tigran2-postgres@riatest.com wrote:
Using files stored outside the database creates all sorts of problems.
For starters you lose ACID guaranties. I would prefer to keep them in
database. We did a lot of experiments with Large Objects and they really
worked fine (stored hundreds of LOBs ranging from a few MB up to 1GB).
Postgres does a really good job with Large Objects. If it was not the
pg_dump problem I would not hesitate to use LOBs.

Yeah, a pg_dump mode that dumped everything but large objects would be nice.

Remembering when that was the only way pg_dump worked and it caused plenty of problems.

But yeah, --only-lobs and --no-lobs might be nice switches.

Right now I find storing large objects in the DB such a pain from a backup management point of view that I avoid it where possible.

I'm now wondering about the idea of implementing a pg_dump option that dumped large objects into a directory tree like
lobs/[loid]/[lob_md5]
and wrote out a restore script that loaded them using `lo_import`.

Thinking of the problems that occurred when we used to require lobs to be backed up to binary archive formats....

During dumping temporary copies could be written to something like lobs/[loid]/.tmp.nnnn with the md5 being calculated on the fly as the byte stream is read. If the dumped file had the same md5 as the existing one it'd just delete the tempfile; otherwise the tempfile would be renamed to the calculated md5.

That way incremental backup systems could manage the dumped LOB tree without quite the same horrible degree of duplication as is currently faced when using lo in the database with pg_dump.

How do incremental backup systems work with lots of data anyway with pg_dump? I would think thats not the approach I would take to incremental backups and PostgreSQL....

A last_modified timestamp on `pg_largeobject_metadata` would be even better, allowing the cost of reading and discarding rarely-changed large objects to be avoided.

It might be interesting to look at the issue of large objects from a total backup perspective. I do wonder though where the end would be. You could have 500MB text fields and those might pose backup issues as well. I suppose with better LOB support on backups, it would give additional options for management.

Best Wishes,

Chris Travers

SQL DATALINK (Re: Storing large files in multiple schemas: BLOB or BYTEA)

From

Wolfgang Keller

Date:

15 October 2012, 12:38:58

> > I need to store large files (from several MB to 1GB) in Postgres
> > database. The database has multiple schemas. It looks like Postgres
> > has 2 options to store large objects: LOB and BYTEA. However we seem
> > to hit problems with each of these options.
>
> I believe the general consensus around here is to not do that, if you
> can avoid it. File systems are much better equipped to handle files
> of that magnitude, especially when it comes to retrieving them,
> scanning through their contents, or really, any access pattern aside
> from simple storage.
>
> You're better off storing the blob on disk somewhere and storing a
> row that refers to its location. Either key pieces for a naming
> scheme or the full path.

What's the state of implementation (planning) for this?:

http://wiki.postgresql.org/wiki/DATALINK

Sincerely,

Wolfgang