Thread: Storing large files in multiple schemas: BLOB or BYTEA
Hi,
I need to store large files (from several MB to 1GB) in Postgres database. The database has multiple schemas. It looks like Postgres has 2 options to store large objects: LOB and BYTEA. However we seem to hit problems with each of these options.
1. LOB. This works almost ideal, can store up to 2GB and allows streaming so that we do not hit memory limits in our PHP backend when reading the LOB. However all blobs are stored in pg_catalog and are not part of schema. This leads to a big problem when you try to use pg_dump with options –n and –b to dump just one schema with its blobs. It dumps the schema data correctly however then it include ALL blobs in the database not just the blobs that belong to the particular schema.
Is there a way to dump the single schema with its blobs using pg_dump or some other utility?
2. BYTEA. These are correctly stored per schema so pg_dump –n works correctly however I cannot seem to find a way to stream the data. This means that there is no way to access the data from PHP if it is larger than memory limit.
Is there any other way to store large data in Postgres that allows streaming and correctly works with multiple schemas per database?
Thanks.
(Sorry if this double-posts on pgsql-php, I did not know which is the best list for this question).
On 10/10/2012 05:16 AM, tigran2-postgres@riatest.com wrote: > I need to store large files (from several MB to 1GB) in Postgres > database. The database has multiple schemas. It looks like Postgres > has 2 options to store large objects: LOB and BYTEA. However we seem > to hit problems with each of these options. I believe the general consensus around here is to not do that, if you can avoid it. File systems are much better equipped to handle files of that magnitude, especially when it comes to retrieving them, scanning through their contents, or really, any access pattern aside from simple storage. You're better off storing the blob on disk somewhere and storing a row that refers to its location. Either key pieces for a naming scheme or the full path. This is especially true if you mean to later access that data with PHP. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
tigran2-postgres@riatest.com wrote: > Is there any other way to store large data in Postgres that allows streaming and correctly works with > multiple schemas per database? Large Objects and bytea are the only ways. If you want to pg_dump only certain large objects, that won't work as far as I know (maybe using permissions and a non-superuser can help). You absolutely need to pg_dump parts of the database regularly? Yours, Laurenz Albe
>Large Objects and bytea are the only ways.
>
>If you want to pg_dump only certain large objects, that won't work as far as I know (maybe using permissions and a non-superuser can help).
>
>You absolutely need to pg_dump parts of the database regularly?
>
>Yours,
>Laurenz Albe
It is not an absolute requirement but would be really nice to have. We have a multi-tenant database with each tenant data stored in a separate scheme. Using pg_dump seems to be the ideal way to migrate tenant data from one database to another when we need to do it to balance the load.
>I believe the general consensus around here is to not do that, if you can avoid it. File systems are much better equipped to handle files of that magnitude, especially when it comes to retrieving them, scanning >through their contents, or really, any access pattern aside from simple storage.
>
>You're better off storing the blob on disk somewhere and storing a row that refers to its location. Either key pieces for a naming scheme or the full path.
>
>This is especially true if you mean to later access that data with PHP.
>
>--
>Shaun Thomas
Using files stored outside the database creates all sorts of problems. For starters you lose ACID guaranties. I would prefer to keep them in database. We did a lot of experiments with Large Objects and they really worked fine (stored hundreds of LOBs ranging from a few MB up to 1GB). Postgres does a really good job with Large Objects. If it was not the pg_dump problem I would not hesitate to use LOBs.
On 10/11/2012 01:35 PM, tigran2-postgres@riatest.com wrote: > Using files stored outside the database creates all sorts of problems. > For starters you lose ACID guaranties. I would prefer to keep them in > database. We did a lot of experiments with Large Objects and they really > worked fine (stored hundreds of LOBs ranging from a few MB up to 1GB). > Postgres does a really good job with Large Objects. If it was not the > pg_dump problem I would not hesitate to use LOBs. Yeah, a pg_dump mode that dumped everything but large objects would be nice. Right now I find storing large objects in the DB such a pain from a backup management point of view that I avoid it where possible. I'm now wondering about the idea of implementing a pg_dump option that dumped large objects into a directory tree like lobs/[loid]/[lob_md5] and wrote out a restore script that loaded them using `lo_import`. During dumping temporary copies could be written to something like lobs/[loid]/.tmp.nnnn with the md5 being calculated on the fly as the byte stream is read. If the dumped file had the same md5 as the existing one it'd just delete the tempfile; otherwise the tempfile would be renamed to the calculated md5. That way incremental backup systems could manage the dumped LOB tree without quite the same horrible degree of duplication as is currently faced when using lo in the database with pg_dump. A last_modified timestamp on `pg_largeobject_metadata` would be even better, allowing the cost of reading and discarding rarely-changed large objects to be avoided. -- Craig Ringer
On 10/11/2012 01:35 PM, tigran2-postgres@riatest.com wrote:Yeah, a pg_dump mode that dumped everything but large objects would be nice.Using files stored outside the database creates all sorts of problems.
For starters you lose ACID guaranties. I would prefer to keep them in
database. We did a lot of experiments with Large Objects and they really
worked fine (stored hundreds of LOBs ranging from a few MB up to 1GB).
Postgres does a really good job with Large Objects. If it was not the
pg_dump problem I would not hesitate to use LOBs.
Right now I find storing large objects in the DB such a pain from a backup management point of view that I avoid it where possible.
I'm now wondering about the idea of implementing a pg_dump option that dumped large objects into a directory tree like
lobs/[loid]/[lob_md5]
and wrote out a restore script that loaded them using `lo_import`.
During dumping temporary copies could be written to something like lobs/[loid]/.tmp.nnnn with the md5 being calculated on the fly as the byte stream is read. If the dumped file had the same md5 as the existing one it'd just delete the tempfile; otherwise the tempfile would be renamed to the calculated md5.
That way incremental backup systems could manage the dumped LOB tree without quite the same horrible degree of duplication as is currently faced when using lo in the database with pg_dump.
A last_modified timestamp on `pg_largeobject_metadata` would be even better, allowing the cost of reading and discarding rarely-changed large objects to be avoided.
SQL DATALINK (Re: Storing large files in multiple schemas: BLOB or BYTEA)
> > I need to store large files (from several MB to 1GB) in Postgres > > database. The database has multiple schemas. It looks like Postgres > > has 2 options to store large objects: LOB and BYTEA. However we seem > > to hit problems with each of these options. > > I believe the general consensus around here is to not do that, if you > can avoid it. File systems are much better equipped to handle files > of that magnitude, especially when it comes to retrieving them, > scanning through their contents, or really, any access pattern aside > from simple storage. > > You're better off storing the blob on disk somewhere and storing a > row that refers to its location. Either key pieces for a naming > scheme or the full path. What's the state of implementation (planning) for this?: http://wiki.postgresql.org/wiki/DATALINK Sincerely, Wolfgang