Thread: PDF files: to store in database or not

PDF files: to store in database or not

From
Rich Shepard
Date:
   With no experience of storing binary data in a bytea column I don't know
when its use is appropriate. I suspect that for an application I'm
developing it would be better to store row-related documents outside the
database, and want to learn if that is the appropriate approach.

   Consider an application that manages a fleet of vehicles. There's a
Vehicles table with information on each one (perhaps make, model, VIN, year
of purchase) and a Services table. There are many PDF documents associated
with each row in the tables: purchase contract, insurance form, service and
maintenance records, etc.

   My thinking is to not store these documents in the database, but to store
them in subdirectories outside the database.

   Your thoughts?

Rich



Re: PDF files: to store in database or not

From
"Joshua D. Drake"
Date:
On 12/06/2016 10:30 AM, Rich Shepard wrote:

>   My thinking is to not store these documents in the database, but to store
> them in subdirectories outside the database.
>
>   Your thoughts?

Due to the widely variable size of a PDF document, I would say no. I
would store the metadata and file location.

Sincerely,

JD


>
> Rich
>
>
>


--
Command Prompt, Inc.                  http://the.postgres.company/
                         +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.


Re: PDF files: to store in database or not

From
Rich Shepard
Date:
On Tue, 6 Dec 2016, Joshua D. Drake wrote:

> Due to the widely variable size of a PDF document, I would say no. I would
> store the metadata and file location.

Joshua,

   I read your answer as "don't store them in the database, but store the
location in a column."

Thanks for confirming,

Rich


Re: PDF files: to store in database or not

From
Moreno Andreo
Date:
... but what if database is replicated?

Thanks
Moreno.

Il 06/12/2016 19:50, Rich Shepard ha scritto:
> On Tue, 6 Dec 2016, Joshua D. Drake wrote:
>
>> Due to the widely variable size of a PDF document, I would say no. I
>> would
>> store the metadata and file location.
>
> Joshua,
>
>   I read your answer as "don't store them in the database, but store the
> location in a column."
>
> Thanks for confirming,
>
> Rich
>
>




Re: PDF files: to store in database or not

From
Eric Schwarzenbach
Date:
On 12/06/2016 01:30 PM, Rich Shepard wrote:
>   With no experience of storing binary data in a bytea column I don't
> know
> when its use is appropriate. I suspect that for an application I'm
> developing it would be better to store row-related documents outside the
> database, and want to learn if that is the appropriate approach.
>
>   Consider an application that manages a fleet of vehicles. There's a
> Vehicles table with information on each one (perhaps make, model, VIN,
> year
> of purchase) and a Services table. There are many PDF documents
> associated
> with each row in the tables: purchase contract, insurance form,
> service and
> maintenance records, etc.
>
>   My thinking is to not store these documents in the database, but to
> store
> them in subdirectories outside the database.
>
>   Your thoughts?
>
> Rich
I'd also be interested in answers to this that are give specific pros
and cons, and not in terms of "its better to do this than that."
What's "better" depends on how much you value the various pros and the
cons.

One of the pros of keeping them in the database is ease of protecting
adds and updates to the files and their related data with a transaction
and being able to have system where it iss pretty much impossible for
the documents to ever be out of sync with the related data.

I maintain some systems that do keep the documents outside of the
database, and the application code maintains the transactional integrity
of the files and data, and for the most part we don't have integrity
problems. In the worst of an add or update operation being interrupted
by a system crash or unexpected error, we have a new document saved but
the data about this document has not been written to the database and it
is as if that operation never happened. The file may really be there but
the system does not "know about it." This works even for updates because
our system versions documents and the old version is not written over,
there is simply a new version that the system never "knows" about.
Without versioning this would be more of a problem, and you would
probably need to protect yourself with code that does something like
temporarily keeping the last version of a file during an update and
switching over the metadata to reference the new document only at the
very last operation in the transaction.

We also have the potential of the database not matching the file store
when a system is migrated or "cloned." We are very careful about this,
but we've at least once had a case where a client's IT depart screwed it
up, and got a mismatched system to which they started writing new data.
Luckily this was a test or staging system and no production data was lost.

I've often wondered if we'd have been better off storing the files in
the database. This design decision was made some years ago, and our
concerns around this had to do with performance, but I don't know that
we had any real data that this should have been a concern, and I suspect
you could ameliorate if not eliminate this as an issue by careful
design. I'd loved to hear this idea confirmed or debunked by someone who
has more expertise (and ideally, done actual testing).

Cheers,

Eric


Re: PDF files: to store in database or not

From
Eric Schwarzenbach
Date:
On 12/06/2016 01:34 PM, Joshua D. Drake wrote:
> On 12/06/2016 10:30 AM, Rich Shepard wrote:
>
>>   My thinking is to not store these documents in the database, but to
>> store
>> them in subdirectories outside the database.
>>
>>   Your thoughts?
>
> Due to the widely variable size of a PDF document, I would say no. I
> would store the metadata and file location.
>
>
Can you elaborate on this? Why is the variable size an issue? Are you
assuming the files go into the same table as the rest of the data? (They
certainly don't have to, and I would assume that not to be the smartest
design.)


Re: PDF files: to store in database or not

From
David Wall
Date:
On 12/6/16 11:12 AM, Eric Schwarzenbach wrote:
> On 12/06/2016 01:34 PM, Joshua D. Drake wrote:
>> On 12/06/2016 10:30 AM, Rich Shepard wrote:
>>
>>>   My thinking is to not store these documents in the database, but
>>> to store
>>> them in subdirectories outside the database.
>>>
>>>   Your thoughts?
>>
>> Due to the widely variable size of a PDF document, I would say no. I
>> would store the metadata and file location.
>>
>>
> Can you elaborate on this? Why is the variable size an issue? Are you
> assuming the files go into the same table as the rest of the data?
> (They certainly don't have to, and I would assume that not to be the
> smartest design.)


The advantages of storing in the database is that a DB backup will have
everything, instead of a DB backup and a file system backup. Using a
BLOB, you can certainly keep track of variable length PDFs.

Also, if in the database, it can be part of a transaction so you will
not have any issues keeping the DB and filesystem in sync.

David


Re: PDF files: to store in database or not

From
John R Pierce
Date:
On 12/6/2016 11:21 AM, David Wall wrote:
>
> The advantages of storing in the database is that a DB backup will
> have everything, instead of a DB backup and a file system backup.
> Using a BLOB, you can certainly keep track of variable length PDFs.

and one of the disadvantages of storing in the database is those db
backups become way huger if there's a lot of this file data.


--
john r pierce, recycling bits in santa cruz



Re: PDF files: to store in database or not

From
"Joshua D. Drake"
Date:
On 12/06/2016 11:12 AM, Eric Schwarzenbach wrote:
> On 12/06/2016 01:34 PM, Joshua D. Drake wrote:
>> On 12/06/2016 10:30 AM, Rich Shepard wrote:
>>
>>>   My thinking is to not store these documents in the database, but to
>>> store
>>> them in subdirectories outside the database.
>>>
>>>   Your thoughts?
>>
>> Due to the widely variable size of a PDF document, I would say no. I
>> would store the metadata and file location.
>>
>>
> Can you elaborate on this? Why is the variable size an issue?

Because it will use at least that size in memory to deliver the document
to you. Consider a 100MB PDF (not at all uncommon), now imagine 40
connections requesting that PDF.


Sincerely,

JD


--
Command Prompt, Inc.                  http://the.postgres.company/
                         +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.


Re: PDF files: to store in database or not

From
"Joshua D. Drake"
Date:
On 12/06/2016 11:09 AM, Moreno Andreo wrote:
> ... but what if database is replicated?

Use a network mounted filesystem (or replicated filesystem).

JD



--
Command Prompt, Inc.                  http://the.postgres.company/
                         +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.


Re: PDF files: to store in database or not

From
"Joshua D. Drake"
Date:
On 12/06/2016 10:50 AM, Rich Shepard wrote:
> On Tue, 6 Dec 2016, Joshua D. Drake wrote:
>
>> Due to the widely variable size of a PDF document, I would say no. I
>> would
>> store the metadata and file location.
>
> Joshua,
>
>   I read your answer as "don't store them in the database, but store the
> location in a column."

Correct.

>
> Thanks for confirming,
>
> Rich
>
>


--
Command Prompt, Inc.                  http://the.postgres.company/
                         +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.


Re: PDF files: to store in database or not

From
"Joshua D. Drake"
Date:
On 12/06/2016 11:21 AM, David Wall wrote:

> The advantages of storing in the database is that a DB backup will have
> everything, instead of a DB backup and a file system backup. Using a
> BLOB, you can certainly keep track of variable length PDFs.

This is true but also not necessarily an advantage. Your backups will be
larger, unless you are using logical backups and omitting certain tables.

Your resource utilization will be higher (memory, CPU etc...) to pull
and decode the binary.

JD




--
Command Prompt, Inc.                  http://the.postgres.company/
                         +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.


Re: PDF files: to store in database or not

From
Rich Shepard
Date:
On Tue, 6 Dec 2016, David Wall wrote:

> The advantages of storing in the database is that a DB backup will have
> everything, instead of a DB backup and a file system backup. Using a BLOB,
> you can certainly keep track of variable length PDFs.

David,

   I did not realize that a BLOB is not the same as a bytea (page 217 of the
9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please
point me in the right direction to learn how to store PDFs as BLOBs.

Thanks,

Rich


Re: PDF files: to store in database or not

From
Eric Schwarzenbach
Date:
On 12/06/2016 02:40 PM, Joshua D. Drake wrote:
> On 12/06/2016 11:12 AM, Eric Schwarzenbach wrote:
>> On 12/06/2016 01:34 PM, Joshua D. Drake wrote:
>>> On 12/06/2016 10:30 AM, Rich Shepard wrote:
>>>
>>>>   My thinking is to not store these documents in the database, but to
>>>> store
>>>> them in subdirectories outside the database.
>>>>
>>>>   Your thoughts?
>>>
>>> Due to the widely variable size of a PDF document, I would say no. I
>>> would store the metadata and file location.
>>>
>>>
>> Can you elaborate on this? Why is the variable size an issue?
>
> Because it will use at least that size in memory to deliver the
> document to you. Consider a 100MB PDF (not at all uncommon), now
> imagine 40 connections requesting that PDF.
>
Are you sure the whole thing necessarily gets pulled into memory? JDBC
and ODBC support streaming on their BLOB interfaces and isn't the whole
point of this that an application can stream large files a chunk at a
time, the same way it would from the file system? Of course if the db
engine always pulls the whole thing into memory to work with it
regardless of the API, that's another thing, but that wouldn't seem like
a very good design, and I have more faith in the PostgreSQL developers
than that...but I'd certainly like to know for sure.

Cheers,

Eric




Re: PDF files: to store in database or not

From
John R Pierce
Date:
On 12/6/2016 12:10 PM, Rich Shepard wrote:
>   I did not realize that a BLOB is not the same as a bytea (page 217
> of the
> 9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please
> point me in the right direction to learn how to store PDFs as BLOBs.

indeed BYTEA is postgres's type for storing arbitrary binary objects
that are called BLOB in certain other databases.


--
john r pierce, recycling bits in santa cruz



Re: PDF files: to store in database or not

From
Tom Lane
Date:
John R Pierce <pierce@hogranch.com> writes:
> On 12/6/2016 12:10 PM, Rich Shepard wrote:
>> I did not realize that a BLOB is not the same as a bytea (page 217
>> of the
>> 9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please
>> point me in the right direction to learn how to store PDFs as BLOBs.

> indeed BYTEA is postgres's type for storing arbitrary binary objects
> that are called BLOB in certain other databases.

Well, there are also "large objects", which aren't really a data type at
all.  If you're storing stuff large enough that you need to write/read
it in chunks rather than all at once, the large-object APIs are what
you want.

            regards, tom lane


Re: PDF files: to store in database or not

From
Rich Shepard
Date:
On Tue, 6 Dec 2016, John R Pierce wrote:

> indeed BYTEA is postgres's type for storing arbitrary binary objects that
> are called BLOB in certain other databases.

John,

   I thought so.

Thanks,

Rich


Re: PDF files: to store in database or not

From
David Wall
Date:
On 12/6/16 12:33 PM, Tom Lane wrote:
> John R Pierce <pierce@hogranch.com> writes:
>> On 12/6/2016 12:10 PM, Rich Shepard wrote:
>>> I did not realize that a BLOB is not the same as a bytea (page 217
>>> of the
>>> 9.6 PDF manual), and I cannot find BLOB as a postgres data type. Please
>>> point me in the right direction to learn how to store PDFs as BLOBs.
>> indeed BYTEA is postgres's type for storing arbitrary binary objects
>> that are called BLOB in certain other databases.
> Well, there are also "large objects", which aren't really a data type at
> all.  If you're storing stuff large enough that you need to write/read
> it in chunks rather than all at once, the large-object APIs are what
> you want.
>
>             regards, tom lane


Yeah, we've not used much BYTEA, but use PG's large objects.  It also
has a streaming API and you don't have to encode/decode every byte going
in and out of the DB.

In a table, you juse define the "blob_data" column as an OID.  Since we
use Java/JDBC, this is handled by ResultSet.getBlob() for a
java.sql.Blob object.

Some complain about DB backups being biggers if the PDFs are inside,
which is true, but this only presumes you don't care about the
filesystem PDFs being backed up separately (and no way to ensure a
reliable DB backup and PDF filesystem backup if the system is active
when doing the backups).  You can certainly put the files in a
filesystem and point to them, but you'll likely need some access control
or people will be able to download any/all PDFs in a given folder.  In
the DB, you surely will have access control as I presume you don't allow
browser access to the DB <smile>.

Either way, you may want to see if your PDFs compress well or not as
that may save some storage space at the cost of compress/decompress on
accesses.

David


Re: PDF files: to store in database or not

From
John R Pierce
Date:
On 12/6/2016 1:02 PM, David Wall wrote:
> You can certainly put the files in a filesystem and point to them, but
> you'll likely need some access control or people will be able to
> download any/all PDFs in a given folder. In the DB, you surely will
> have access control as I presume you don't allow browser access to the
> DB <smile>.

thats easily remedied by using storage accessed via https or whatever
with directory listing disabled, and hashed names.   Or, only allow your
app server direct access to the file system if its via NFS or whatever.



> Either way, you may want to see if your PDFs compress well or not as
> that may save some storage space at the cost of compress/decompress on
> accesses.


pretty sure most all PDF's are already compressed, and won't compress
much further.

--
john r pierce, recycling bits in santa cruz



Re: PDF files: to store in database or not

From
John DeSoi
Date:
> On Dec 6, 2016, at 1:09 PM, Eric Schwarzenbach <subscriber@blackbrook.org> wrote:
>
> I've often wondered if we'd have been better off storing the files in the database. This design decision was made
someyears ago, and our concerns around this had to do with performance, but I don't know that we had any real data that
thisshould have been a concern, and I suspect you could ameliorate if not eliminate this as an issue by careful design.
I'dloved to hear this idea confirmed or debunked by someone who has more expertise (and ideally, done actual testing). 

I have been storing PDFs in Postgres for several years without any problems. Documents range in size from a few pages
to100+ pages. I'm using a bytea column, not large objects. I store the documents in a separate database from the rest
ofthe application data in order to make it easy to exclude in database dumps or backup in some other way. I'm currently
managingabout 600,000 documents. 

I created some functions that enable a subset of the document database to be synchronized elsewhere. For example, we
needto keep only the last 3 years of documents on a website for user access. Using Postgres has made this easy to
manageand verify. And with replication we automatically have the document database available on the backup web server
withoutadditional effort. 

John DeSoi, Ph.D.



Re: PDF files: to store in database or not

From
Rich Shepard
Date:
On Thu, 8 Dec 2016, John DeSoi wrote:

> I have been storing PDFs in Postgres for several years without any
> problems. Documents range in size from a few pages to 100+ pages. I'm
> using a bytea column, not large objects. I store the documents in a
> separate database from the rest of the application data in order to make
> it easy to exclude in database dumps or backup in some other way. I'm
> currently managing about 600,000 documents.

John,

   This is really good information. Rather than using a separate database I
think that storing all PDFs in a separate table makes sense for my
application. Backup practices will be the domain of those using the
application (which I've decided to open-source and give away because I'm not
in the software business). A simple join to the appropriate data table will
make them available.

   Not having used the bytea data type before I'll read how to work with it.

Thanks very much for your insights,

Rich


Re: PDF files: to store in database or not

From
Adrian Klaver
Date:
On 12/08/2016 07:16 AM, Rich Shepard wrote:
> On Thu, 8 Dec 2016, John DeSoi wrote:
>
>> I have been storing PDFs in Postgres for several years without any
>> problems. Documents range in size from a few pages to 100+ pages. I'm
>> using a bytea column, not large objects. I store the documents in a
>> separate database from the rest of the application data in order to make
>> it easy to exclude in database dumps or backup in some other way. I'm
>> currently managing about 600,000 documents.
>
> John,
>
>   This is really good information. Rather than using a separate database I
> think that storing all PDFs in a separate table makes sense for my
> application. Backup practices will be the domain of those using the
> application (which I've decided to open-source and give away because I'm
> not
> in the software business). A simple join to the appropriate data table will
> make them available.
>
>   Not having used the bytea data type before I'll read how to work with it.

http://initd.org/psycopg/docs/usage.html?highlight=binary#adapt-binary

>
> Thanks very much for your insights,
>
> Rich
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: PDF files: to store in database or not

From
Chris Travers
Date:


On Thu, Dec 8, 2016 at 7:16 AM, Rich Shepard <rshepard@appl-ecosys.com> wrote:
On Thu, 8 Dec 2016, John DeSoi wrote:

I have been storing PDFs in Postgres for several years without any
problems. Documents range in size from a few pages to 100+ pages. I'm
using a bytea column, not large objects. I store the documents in a
separate database from the rest of the application data in order to make
it easy to exclude in database dumps or backup in some other way. I'm
currently managing about 600,000 documents.

John,

  This is really good information. Rather than using a separate database I
think that storing all PDFs in a separate table makes sense for my
application. Backup practices will be the domain of those using the
application (which I've decided to open-source and give away because I'm not
in the software business). A simple join to the appropriate data table will
make them available.

  Not having used the bytea data type before I'll read how to work with it.

Assuming relatively small files, bytea makes much more sense than a large object.  However note that encoding and decoding can be relatively memory intensive depending on your environment.  This is not a problem with small files and I would typically start to worry when you get into the hundreds of mb in size.  At least in Perl, I expect decoding to take about 8x the size of the final file in RAM.

LOBs work best when you need a streaming interface (seek and friends) while bytea's are otherwise much more pleasant to work with. 

Thanks very much for your insights,

Rich


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: PDF files: to store in database or not

From
Rich Shepard
Date:
On Thu, 8 Dec 2016, Adrian Klaver wrote:

> http://initd.org/psycopg/docs/usage.html?highlight=binary#adapt-binary

   Thanks again, Adrian.

Rich


Re: PDF files: to store in database or not

From
John DeSoi
Date:
> On Dec 8, 2016, at 9:25 AM, Chris Travers <chris.travers@gmail.com> wrote:
>
> Assuming relatively small files, bytea makes much more sense than a large object.  However note that encoding and
decodingcan be relatively memory intensive depending on your environment.  This is not a problem with small files and I
wouldtypically start to worry when you get into the hundreds of mb in size.  At least in Perl, I expect decoding to
takeabout 8x the size of the final file in RAM. 
>
> LOBs work best when you need a streaming interface (seek and friends) while bytea's are otherwise much more pleasant
towork with.  

Not much I can do on the Postgres side, but you can manage the amount of RAM needed on the client side by returning the
byteain chunks using a set returning function. In my case, this returns chunks to PHP that are immediately written to
thedownload stream so there is no need to have the entire document in RAM on the application side. I have included the
functionI use below. 

John DeSoi, Ph.D.


create or replace function blob_content_chunked(p_dbid integer)
returns setof bytea as $$
declare
  v_chunk integer = 1048576;
  v_start integer = 1;
  v_data bytea;
  v_size integer;
begin
  select into v_data content from blob where dbid = p_dbid;
  if found and v_data is not null then
    v_size = octet_length(v_data);
    if v_size <= v_chunk then
      return next v_data;
    else
      for i in 1..v_size by v_chunk loop
        return next substring(v_data from i for v_chunk);
      end loop;
    end if;
  end if;
end;
$$ language plpgsql stable;

Re: PDF files: to store in database or not

From
Rich Shepard
Date:
On Thu, 8 Dec 2016, Chris Travers wrote:

> Assuming relatively small files, bytea makes much more sense than a large
> object.

Hi Chris,

   Most of the documents are only a few pages in size.

> LOBs work best when you need a streaming interface (seek and friends) while
> bytea's are otherwise much more pleasant to work with.

   They're not referenced frequently, only occasionally.

Thanks,

Rich