Thread: PDF files: to store in database or not
With no experience of storing binary data in a bytea column I don't know when its use is appropriate. I suspect that for an application I'm developing it would be better to store row-related documents outside the database, and want to learn if that is the appropriate approach. Consider an application that manages a fleet of vehicles. There's a Vehicles table with information on each one (perhaps make, model, VIN, year of purchase) and a Services table. There are many PDF documents associated with each row in the tables: purchase contract, insurance form, service and maintenance records, etc. My thinking is to not store these documents in the database, but to store them in subdirectories outside the database. Your thoughts? Rich
On 12/06/2016 10:30 AM, Rich Shepard wrote: > My thinking is to not store these documents in the database, but to store > them in subdirectories outside the database. > > Your thoughts? Due to the widely variable size of a PDF document, I would say no. I would store the metadata and file location. Sincerely, JD > > Rich > > > -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own.
On Tue, 6 Dec 2016, Joshua D. Drake wrote: > Due to the widely variable size of a PDF document, I would say no. I would > store the metadata and file location. Joshua, I read your answer as "don't store them in the database, but store the location in a column." Thanks for confirming, Rich
... but what if database is replicated? Thanks Moreno. Il 06/12/2016 19:50, Rich Shepard ha scritto: > On Tue, 6 Dec 2016, Joshua D. Drake wrote: > >> Due to the widely variable size of a PDF document, I would say no. I >> would >> store the metadata and file location. > > Joshua, > > I read your answer as "don't store them in the database, but store the > location in a column." > > Thanks for confirming, > > Rich > >
On 12/06/2016 01:30 PM, Rich Shepard wrote: > With no experience of storing binary data in a bytea column I don't > know > when its use is appropriate. I suspect that for an application I'm > developing it would be better to store row-related documents outside the > database, and want to learn if that is the appropriate approach. > > Consider an application that manages a fleet of vehicles. There's a > Vehicles table with information on each one (perhaps make, model, VIN, > year > of purchase) and a Services table. There are many PDF documents > associated > with each row in the tables: purchase contract, insurance form, > service and > maintenance records, etc. > > My thinking is to not store these documents in the database, but to > store > them in subdirectories outside the database. > > Your thoughts? > > Rich I'd also be interested in answers to this that are give specific pros and cons, and not in terms of "its better to do this than that." What's "better" depends on how much you value the various pros and the cons. One of the pros of keeping them in the database is ease of protecting adds and updates to the files and their related data with a transaction and being able to have system where it iss pretty much impossible for the documents to ever be out of sync with the related data. I maintain some systems that do keep the documents outside of the database, and the application code maintains the transactional integrity of the files and data, and for the most part we don't have integrity problems. In the worst of an add or update operation being interrupted by a system crash or unexpected error, we have a new document saved but the data about this document has not been written to the database and it is as if that operation never happened. The file may really be there but the system does not "know about it." This works even for updates because our system versions documents and the old version is not written over, there is simply a new version that the system never "knows" about. Without versioning this would be more of a problem, and you would probably need to protect yourself with code that does something like temporarily keeping the last version of a file during an update and switching over the metadata to reference the new document only at the very last operation in the transaction. We also have the potential of the database not matching the file store when a system is migrated or "cloned." We are very careful about this, but we've at least once had a case where a client's IT depart screwed it up, and got a mismatched system to which they started writing new data. Luckily this was a test or staging system and no production data was lost. I've often wondered if we'd have been better off storing the files in the database. This design decision was made some years ago, and our concerns around this had to do with performance, but I don't know that we had any real data that this should have been a concern, and I suspect you could ameliorate if not eliminate this as an issue by careful design. I'd loved to hear this idea confirmed or debunked by someone who has more expertise (and ideally, done actual testing). Cheers, Eric
On 12/06/2016 01:34 PM, Joshua D. Drake wrote: > On 12/06/2016 10:30 AM, Rich Shepard wrote: > >> My thinking is to not store these documents in the database, but to >> store >> them in subdirectories outside the database. >> >> Your thoughts? > > Due to the widely variable size of a PDF document, I would say no. I > would store the metadata and file location. > > Can you elaborate on this? Why is the variable size an issue? Are you assuming the files go into the same table as the rest of the data? (They certainly don't have to, and I would assume that not to be the smartest design.)
On 12/6/16 11:12 AM, Eric Schwarzenbach wrote: > On 12/06/2016 01:34 PM, Joshua D. Drake wrote: >> On 12/06/2016 10:30 AM, Rich Shepard wrote: >> >>> My thinking is to not store these documents in the database, but >>> to store >>> them in subdirectories outside the database. >>> >>> Your thoughts? >> >> Due to the widely variable size of a PDF document, I would say no. I >> would store the metadata and file location. >> >> > Can you elaborate on this? Why is the variable size an issue? Are you > assuming the files go into the same table as the rest of the data? > (They certainly don't have to, and I would assume that not to be the > smartest design.) The advantages of storing in the database is that a DB backup will have everything, instead of a DB backup and a file system backup. Using a BLOB, you can certainly keep track of variable length PDFs. Also, if in the database, it can be part of a transaction so you will not have any issues keeping the DB and filesystem in sync. David
On 12/6/2016 11:21 AM, David Wall wrote: > > The advantages of storing in the database is that a DB backup will > have everything, instead of a DB backup and a file system backup. > Using a BLOB, you can certainly keep track of variable length PDFs. and one of the disadvantages of storing in the database is those db backups become way huger if there's a lot of this file data. -- john r pierce, recycling bits in santa cruz
On 12/06/2016 11:12 AM, Eric Schwarzenbach wrote: > On 12/06/2016 01:34 PM, Joshua D. Drake wrote: >> On 12/06/2016 10:30 AM, Rich Shepard wrote: >> >>> My thinking is to not store these documents in the database, but to >>> store >>> them in subdirectories outside the database. >>> >>> Your thoughts? >> >> Due to the widely variable size of a PDF document, I would say no. I >> would store the metadata and file location. >> >> > Can you elaborate on this? Why is the variable size an issue? Because it will use at least that size in memory to deliver the document to you. Consider a 100MB PDF (not at all uncommon), now imagine 40 connections requesting that PDF. Sincerely, JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own.
On 12/06/2016 11:09 AM, Moreno Andreo wrote: > ... but what if database is replicated? Use a network mounted filesystem (or replicated filesystem). JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own.
On 12/06/2016 10:50 AM, Rich Shepard wrote: > On Tue, 6 Dec 2016, Joshua D. Drake wrote: > >> Due to the widely variable size of a PDF document, I would say no. I >> would >> store the metadata and file location. > > Joshua, > > I read your answer as "don't store them in the database, but store the > location in a column." Correct. > > Thanks for confirming, > > Rich > > -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own.
On 12/06/2016 11:21 AM, David Wall wrote: > The advantages of storing in the database is that a DB backup will have > everything, instead of a DB backup and a file system backup. Using a > BLOB, you can certainly keep track of variable length PDFs. This is true but also not necessarily an advantage. Your backups will be larger, unless you are using logical backups and omitting certain tables. Your resource utilization will be higher (memory, CPU etc...) to pull and decode the binary. JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own.
On Tue, 6 Dec 2016, David Wall wrote: > The advantages of storing in the database is that a DB backup will have > everything, instead of a DB backup and a file system backup. Using a BLOB, > you can certainly keep track of variable length PDFs. David, I did not realize that a BLOB is not the same as a bytea (page 217 of the 9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please point me in the right direction to learn how to store PDFs as BLOBs. Thanks, Rich
On 12/06/2016 02:40 PM, Joshua D. Drake wrote: > On 12/06/2016 11:12 AM, Eric Schwarzenbach wrote: >> On 12/06/2016 01:34 PM, Joshua D. Drake wrote: >>> On 12/06/2016 10:30 AM, Rich Shepard wrote: >>> >>>> My thinking is to not store these documents in the database, but to >>>> store >>>> them in subdirectories outside the database. >>>> >>>> Your thoughts? >>> >>> Due to the widely variable size of a PDF document, I would say no. I >>> would store the metadata and file location. >>> >>> >> Can you elaborate on this? Why is the variable size an issue? > > Because it will use at least that size in memory to deliver the > document to you. Consider a 100MB PDF (not at all uncommon), now > imagine 40 connections requesting that PDF. > Are you sure the whole thing necessarily gets pulled into memory? JDBC and ODBC support streaming on their BLOB interfaces and isn't the whole point of this that an application can stream large files a chunk at a time, the same way it would from the file system? Of course if the db engine always pulls the whole thing into memory to work with it regardless of the API, that's another thing, but that wouldn't seem like a very good design, and I have more faith in the PostgreSQL developers than that...but I'd certainly like to know for sure. Cheers, Eric
On 12/6/2016 12:10 PM, Rich Shepard wrote: > I did not realize that a BLOB is not the same as a bytea (page 217 > of the > 9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please > point me in the right direction to learn how to store PDFs as BLOBs. indeed BYTEA is postgres's type for storing arbitrary binary objects that are called BLOB in certain other databases. -- john r pierce, recycling bits in santa cruz
John R Pierce <pierce@hogranch.com> writes: > On 12/6/2016 12:10 PM, Rich Shepard wrote: >> I did not realize that a BLOB is not the same as a bytea (page 217 >> of the >> 9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please >> point me in the right direction to learn how to store PDFs as BLOBs. > indeed BYTEA is postgres's type for storing arbitrary binary objects > that are called BLOB in certain other databases. Well, there are also "large objects", which aren't really a data type at all. If you're storing stuff large enough that you need to write/read it in chunks rather than all at once, the large-object APIs are what you want. regards, tom lane
On Tue, 6 Dec 2016, John R Pierce wrote: > indeed BYTEA is postgres's type for storing arbitrary binary objects that > are called BLOB in certain other databases. John, I thought so. Thanks, Rich
On 12/6/16 12:33 PM, Tom Lane wrote: > John R Pierce <pierce@hogranch.com> writes: >> On 12/6/2016 12:10 PM, Rich Shepard wrote: >>> I did not realize that a BLOB is not the same as a bytea (page 217 >>> of the >>> 9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please >>> point me in the right direction to learn how to store PDFs as BLOBs. >> indeed BYTEA is postgres's type for storing arbitrary binary objects >> that are called BLOB in certain other databases. > Well, there are also "large objects", which aren't really a data type at > all. If you're storing stuff large enough that you need to write/read > it in chunks rather than all at once, the large-object APIs are what > you want. > > regards, tom lane Yeah, we've not used much BYTEA, but use PG's large objects. It also has a streaming API and you don't have to encode/decode every byte going in and out of the DB. In a table, you juse define the "blob_data" column as an OID. Since we use Java/JDBC, this is handled by ResultSet.getBlob() for a java.sql.Blob object. Some complain about DB backups being biggers if the PDFs are inside, which is true, but this only presumes you don't care about the filesystem PDFs being backed up separately (and no way to ensure a reliable DB backup and PDF filesystem backup if the system is active when doing the backups). You can certainly put the files in a filesystem and point to them, but you'll likely need some access control or people will be able to download any/all PDFs in a given folder. In the DB, you surely will have access control as I presume you don't allow browser access to the DB <smile>. Either way, you may want to see if your PDFs compress well or not as that may save some storage space at the cost of compress/decompress on accesses. David
On 12/6/2016 1:02 PM, David Wall wrote: > You can certainly put the files in a filesystem and point to them, but > you'll likely need some access control or people will be able to > download any/all PDFs in a given folder. In the DB, you surely will > have access control as I presume you don't allow browser access to the > DB <smile>. thats easily remedied by using storage accessed via https or whatever with directory listing disabled, and hashed names. Or, only allow your app server direct access to the file system if its via NFS or whatever. > Either way, you may want to see if your PDFs compress well or not as > that may save some storage space at the cost of compress/decompress on > accesses. pretty sure most all PDF's are already compressed, and won't compress much further. -- john r pierce, recycling bits in santa cruz
> On Dec 6, 2016, at 1:09 PM, Eric Schwarzenbach <subscriber@blackbrook.org> wrote: > > I've often wondered if we'd have been better off storing the files in the database. This design decision was made someyears ago, and our concerns around this had to do with performance, but I don't know that we had any real data that thisshould have been a concern, and I suspect you could ameliorate if not eliminate this as an issue by careful design. I'dloved to hear this idea confirmed or debunked by someone who has more expertise (and ideally, done actual testing). I have been storing PDFs in Postgres for several years without any problems. Documents range in size from a few pages to100+ pages. I'm using a bytea column, not large objects. I store the documents in a separate database from the rest ofthe application data in order to make it easy to exclude in database dumps or backup in some other way. I'm currently managingabout 600,000 documents. I created some functions that enable a subset of the document database to be synchronized elsewhere. For example, we needto keep only the last 3 years of documents on a website for user access. Using Postgres has made this easy to manageand verify. And with replication we automatically have the document database available on the backup web server withoutadditional effort. John DeSoi, Ph.D.
On Thu, 8 Dec 2016, John DeSoi wrote: > I have been storing PDFs in Postgres for several years without any > problems. Documents range in size from a few pages to 100+ pages. I'm > using a bytea column, not large objects. I store the documents in a > separate database from the rest of the application data in order to make > it easy to exclude in database dumps or backup in some other way. I'm > currently managing about 600,000 documents. John, This is really good information. Rather than using a separate database I think that storing all PDFs in a separate table makes sense for my application. Backup practices will be the domain of those using the application (which I've decided to open-source and give away because I'm not in the software business). A simple join to the appropriate data table will make them available. Not having used the bytea data type before I'll read how to work with it. Thanks very much for your insights, Rich
On 12/08/2016 07:16 AM, Rich Shepard wrote: > On Thu, 8 Dec 2016, John DeSoi wrote: > >> I have been storing PDFs in Postgres for several years without any >> problems. Documents range in size from a few pages to 100+ pages. I'm >> using a bytea column, not large objects. I store the documents in a >> separate database from the rest of the application data in order to make >> it easy to exclude in database dumps or backup in some other way. I'm >> currently managing about 600,000 documents. > > John, > > This is really good information. Rather than using a separate database I > think that storing all PDFs in a separate table makes sense for my > application. Backup practices will be the domain of those using the > application (which I've decided to open-source and give away because I'm > not > in the software business). A simple join to the appropriate data table will > make them available. > > Not having used the bytea data type before I'll read how to work with it. http://initd.org/psycopg/docs/usage.html?highlight=binary#adapt-binary > > Thanks very much for your insights, > > Rich > > -- Adrian Klaver adrian.klaver@aklaver.com
On Thu, Dec 8, 2016 at 7:16 AM, Rich Shepard <rshepard@appl-ecosys.com> wrote:
On Thu, 8 Dec 2016, John DeSoi wrote:I have been storing PDFs in Postgres for several years without any
problems. Documents range in size from a few pages to 100+ pages. I'm
using a bytea column, not large objects. I store the documents in a
separate database from the rest of the application data in order to make
it easy to exclude in database dumps or backup in some other way. I'm
currently managing about 600,000 documents.
John,
This is really good information. Rather than using a separate database I
think that storing all PDFs in a separate table makes sense for my
application. Backup practices will be the domain of those using the
application (which I've decided to open-source and give away because I'm not
in the software business). A simple join to the appropriate data table will
make them available.
Not having used the bytea data type before I'll read how to work with it.
Assuming relatively small files, bytea makes much more sense than a large object. However note that encoding and decoding can be relatively memory intensive depending on your environment. This is not a problem with small files and I would typically start to worry when you get into the hundreds of mb in size. At least in Perl, I expect decoding to take about 8x the size of the final file in RAM.
LOBs work best when you need a streaming interface (seek and friends) while bytea's are otherwise much more pleasant to work with.
LOBs work best when you need a streaming interface (seek and friends) while bytea's are otherwise much more pleasant to work with.
Thanks very much for your insights,
Rich
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Best Wishes,
Chris Travers
Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.
On Thu, 8 Dec 2016, Adrian Klaver wrote: > http://initd.org/psycopg/docs/usage.html?highlight=binary#adapt-binary Thanks again, Adrian. Rich
> On Dec 8, 2016, at 9:25 AM, Chris Travers <chris.travers@gmail.com> wrote: > > Assuming relatively small files, bytea makes much more sense than a large object. However note that encoding and decodingcan be relatively memory intensive depending on your environment. This is not a problem with small files and I wouldtypically start to worry when you get into the hundreds of mb in size. At least in Perl, I expect decoding to takeabout 8x the size of the final file in RAM. > > LOBs work best when you need a streaming interface (seek and friends) while bytea's are otherwise much more pleasant towork with. Not much I can do on the Postgres side, but you can manage the amount of RAM needed on the client side by returning the byteain chunks using a set returning function. In my case, this returns chunks to PHP that are immediately written to thedownload stream so there is no need to have the entire document in RAM on the application side. I have included the functionI use below. John DeSoi, Ph.D. create or replace function blob_content_chunked(p_dbid integer) returns setof bytea as $$ declare v_chunk integer = 1048576; v_start integer = 1; v_data bytea; v_size integer; begin select into v_data content from blob where dbid = p_dbid; if found and v_data is not null then v_size = octet_length(v_data); if v_size <= v_chunk then return next v_data; else for i in 1..v_size by v_chunk loop return next substring(v_data from i for v_chunk); end loop; end if; end if; end; $$ language plpgsql stable;
On Thu, 8 Dec 2016, Chris Travers wrote: > Assuming relatively small files, bytea makes much more sense than a large > object. Hi Chris, Most of the documents are only a few pages in size. > LOBs work best when you need a streaming interface (seek and friends) while > bytea's are otherwise much more pleasant to work with. They're not referenced frequently, only occasionally. Thanks, Rich