Thread: Need suggestion

Need suggestion

From
Carl von Clausewitz
Date:
Hello Everyone,

I got a new project, with 100 user in Europe. In this case, I need to handle production and sales processes an its documentations in PostgreSQL with PHP. The load of the sales process is negligible, but every user produces 2 transaction in the production process, with 10-30 scanned documents (each are 400kb - 800kb), and 30-50 high resolution pictures (each are 3-8 MB), and they wanted to upload it to 'somewhere'. 'Somewhere' could be the server files system, and a link in the PostgreSQL database for the location of the files (with some metadata), or it could be the PostgreSQL database. 

My question is that: what is your opinion about to store the scanned documentation and the pictures in the database? This is a huge amount of data (between daily 188MB and 800MB data, average year is about 1 TB data), but is must be searchable, and any document must be retrieved within 1 hour. Every documentations must be stored for up to 5 years... It means the database could be about 6-7 TB large after 5 years, and then we can start to archive documents. Any other data size is negligible. 

If you suggest, to store all of the data in PostgreSQL, what is your recommendation about table, index structure, clustering, archiving? 

Thank you in advance!
Regards,
Carl 

Re: Need suggestion

From
Ognjen Blagojevic
Date:
Carl,

I don't have experience with that big databases, but I did both
solutions, and here are pros of both of them:

1. Files stored on the filesystem:
- Small database footprint
- Faster backup, export and import

2. Files stored in the database
- RDBMS takes care of transactions and ref. int.
- Slower backup, export and import but all done in one step
- Easier continuous archiving

I slightly prefer option no. 2, since transaction handling, rollback and
ref. integrity is not so easy to implement when you have two different
storage systems (FS and RDB).

As for indexes and tables it is not clear form your message whether you
need just a regular search (field LIKE 'something'), full text search of
metadata, or full text search of scanned documents (in case they are OCRed).

Regards,
Ognjen


On 1.6.2011 10:08, Carl von Clausewitz wrote:
> Hello Everyone,
>
> I got a new project, with 100 user in Europe. In this case, I need to
> handle production and sales processes an its documentations in
> PostgreSQL with PHP. The load of the sales process is negligible, but
> every user produces 2 transaction in the production process, with 10-30
> scanned documents (each are 400kb - 800kb), and 30-50 high resolution
> pictures (each are 3-8 MB), and they wanted to upload it to 'somewhere'.
> 'Somewhere' could be the server files system, and a link in
> the PostgreSQL database for the location of the files (with some
> metadata), or it could be the PostgreSQL database.
>
> My question is that: what is your opinion about to store the scanned
> documentation and the pictures in the database? This is a huge amount of
> data (between daily 188MB and 800MB data, average year is about 1 TB
> data), but is must be searchable, and any document must be retrieved
> within 1 hour. Every documentations must be stored for up to 5 years...
> It means the database could be about 6-7 TB large after 5 years, and
> then we can start to archive documents. Any other data size is negligible.
>
> If you suggest, to store all of the data in PostgreSQL, what is your
> recommendation about table, index structure, clustering, archiving?
>
> Thank you in advance!
> Regards,
> Carl


Re: Need suggestion

From
Carl von Clausewitz
Date:
Dear Ognjen,

thank you - that was my idea too, but I've never seen such a workload like this. The docu's (which are not processed by any ocr hopefully) and the pictures are not indexed off course, just some metadatas, which are related to the exact docu, or pic For example:

productions_docu1:
-sent date
-recieved date
-type
-owner
-case_id
-etc

image_001:
-picturetaken date
-case_id
-image_type

Just these metadatas need to be searched. My questions about the structure was like this: do you recommend, to store the images and the docu's in a same table (CREATE TABLE docu_img_store (id BIGSERIAL, case_id BIGINT, content_type INTEGER, content bytea), or store it in two different tables? Is there any special settings while table creations, that I have to set for optimal work (like index, storage parameter, toast, etc). 

(:-) I know, that this project could be a high value revenue for any DB consultancy related company, but this is a small country, with small project fees, and I'm employee, not a contractor at my company :-) 

Thanks you in advance,
Regards,
Carl


2011/6/2 Ognjen Blagojevic <ognjen.d.blagojevic@gmail.com>
Carl,

I don't have experience with that big databases, but I did both solutions, and here are pros of both of them:

1. Files stored on the filesystem:
- Small database footprint
- Faster backup, export and import

2. Files stored in the database
- RDBMS takes care of transactions and ref. int.
- Slower backup, export and import but all done in one step
- Easier continuous archiving

I slightly prefer option no. 2, since transaction handling, rollback and ref. integrity is not so easy to implement when you have two different storage systems (FS and RDB).

As for indexes and tables it is not clear form your message whether you need just a regular search (field LIKE 'something'), full text search of metadata, or full text search of scanned documents (in case they are OCRed).

Regards,
Ognjen



On 1.6.2011 10:08, Carl von Clausewitz wrote:
Hello Everyone,

I got a new project, with 100 user in Europe. In this case, I need to
handle production and sales processes an its documentations in
PostgreSQL with PHP. The load of the sales process is negligible, but
every user produces 2 transaction in the production process, with 10-30
scanned documents (each are 400kb - 800kb), and 30-50 high resolution
pictures (each are 3-8 MB), and they wanted to upload it to 'somewhere'.
'Somewhere' could be the server files system, and a link in
the PostgreSQL database for the location of the files (with some
metadata), or it could be the PostgreSQL database.

My question is that: what is your opinion about to store the scanned
documentation and the pictures in the database? This is a huge amount of
data (between daily 188MB and 800MB data, average year is about 1 TB
data), but is must be searchable, and any document must be retrieved
within 1 hour. Every documentations must be stored for up to 5 years...
It means the database could be about 6-7 TB large after 5 years, and
then we can start to archive documents. Any other data size is negligible.

If you suggest, to store all of the data in PostgreSQL, what is your
recommendation about table, index structure, clustering, archiving?

Thank you in advance!
Regards,
Carl


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Need suggestion

From
Ognjen Blagojevic
Date:
Carl,

Please keep in mind I am not a Postgres expert nor consultant, I'm just
sharing my experience. I would also like to hear the opinion of other
people who worked on projects with similar database sizes.

I would keep all files in the single table -- most probably they will be
served to the user by the same code (e.g. file download servlet or
something similar) so it is good if all relevant info is in one table,
something like:

file {
  id
  mime_type
  name
  content bytea
  creation_date datetime
  modification_date datetime
  creation_user
  modification_user
}

Since both image and document inherits file, you may choose any of the
common RDB inheritance modeling strategies (one table per hierarchy, one
table per class...), but since there is just a few fields, I would put
everything in the same table.

Consider cardinality between cases and files/users. Can one file be
related with two cases and so on...

Toast table will be splitted in 1GB pieces.

Create indexes considering ways your users will browse or search data.

Regards,
Ognjen


On 2.6.2011 12:22, Carl von Clausewitz wrote:
> Dear Ognjen,
>
> thank you - that was my idea too, but I've never seen such a workload
> like this. The docu's (which are not processed by any ocr hopefully) and
> the pictures are not indexed off course, just some metadatas, which are
> related to the exact docu, or pic For example:
>
> productions_docu1:
> -sent date
> -recieved date
> -type
> -owner
> -case_id
> -etc
>
> image_001:
> -picturetaken date
> -case_id
> -image_type
>
> Just these metadatas need to be searched. My questions about the
> structure was like this: do you recommend, to store the images and the
> docu's in a same table (CREATE TABLE docu_img_store (id
> BIGSERIAL, case_id BIGINT, content_type INTEGER, content bytea), or
> store it in two different tables? Is there any special settings while
> table creations, that I have to set for optimal work (like index,
> storage parameter, toast, etc).
>
> (:-) I know, that this project could be a high value revenue for any DB
> consultancy related company, but this is a small country, with small
> project fees, and I'm employee, not a contractor at my company :-)
>
> Thanks you in advance,
> Regards,
> Carl
>
>
> 2011/6/2 Ognjen Blagojevic <ognjen.d.blagojevic@gmail.com
> <mailto:ognjen.d.blagojevic@gmail.com>>
>
>     Carl,
>
>     I don't have experience with that big databases, but I did both
>     solutions, and here are pros of both of them:
>
>     1. Files stored on the filesystem:
>     - Small database footprint
>     - Faster backup, export and import
>
>     2. Files stored in the database
>     - RDBMS takes care of transactions and ref. int.
>     - Slower backup, export and import but all done in one step
>     - Easier continuous archiving
>
>     I slightly prefer option no. 2, since transaction handling, rollback
>     and ref. integrity is not so easy to implement when you have two
>     different storage systems (FS and RDB).
>
>     As for indexes and tables it is not clear form your message whether
>     you need just a regular search (field LIKE 'something'), full text
>     search of metadata, or full text search of scanned documents (in
>     case they are OCRed).
>
>     Regards,
>     Ognjen
>
>
>
>     On 1.6.2011 10:08, Carl von Clausewitz wrote:
>
>         Hello Everyone,
>
>         I got a new project, with 100 user in Europe. In this case, I
>         need to
>         handle production and sales processes an its documentations in
>         PostgreSQL with PHP. The load of the sales process is
>         negligible, but
>         every user produces 2 transaction in the production process,
>         with 10-30
>         scanned documents (each are 400kb - 800kb), and 30-50 high
>         resolution
>         pictures (each are 3-8 MB), and they wanted to upload it to
>         'somewhere'.
>         'Somewhere' could be the server files system, and a link in
>         the PostgreSQL database for the location of the files (with some
>         metadata), or it could be the PostgreSQL database.
>
>         My question is that: what is your opinion about to store the scanned
>         documentation and the pictures in the database? This is a huge
>         amount of
>         data (between daily 188MB and 800MB data, average year is about 1 TB
>         data), but is must be searchable, and any document must be retrieved
>         within 1 hour. Every documentations must be stored for up to 5
>         years...
>         It means the database could be about 6-7 TB large after 5 years, and
>         then we can start to archive documents. Any other data size is
>         negligible.
>
>         If you suggest, to store all of the data in PostgreSQL, what is your
>         recommendation about table, index structure, clustering, archiving?
>
>         Thank you in advance!
>         Regards,
>         Carl
>
>
>
>     --
>     Sent via pgsql-general mailing list (pgsql-general@postgresql.org
>     <mailto:pgsql-general@postgresql.org>)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-general
>
>


Re: Need suggestion

From
Ben Chobot
Date:
On Jun 1, 2011, at 1:08 AM, Carl von Clausewitz wrote:

> Hello Everyone,
>
> I got a new project, with 100 user in Europe. In this case, I need to handle production and sales processes an its
documentationsin PostgreSQL with PHP. The load of the sales process is negligible, but every user produces 2
transactionin the production process, with 10-30 scanned documents (each are 400kb - 800kb), and 30-50 high resolution
pictures(each are 3-8 MB), and they wanted to upload it to 'somewhere'. 'Somewhere' could be the server files system,
anda link in the PostgreSQL database for the location of the files (with some metadata), or it could be the PostgreSQL
database. 
>
> My question is that: what is your opinion about to store the scanned documentation and the pictures in the database?
Thisis a huge amount of data (between daily 188MB and 800MB data, average year is about 1 TB data), but is must be
searchable,and any document must be retrieved within 1 hour. Every documentations must be stored for up to 5 years...
Itmeans the database could be about 6-7 TB large after 5 years, and then we can start to archive documents. Any other
datasize is negligible.  
>
> If you suggest, to store all of the data in PostgreSQL, what is your recommendation about table, index structure,
clustering,archiving?  

So, you're mostly storing ~1TB of images/year? That doesn't seem so bad. How will the documents be searched? Will their
contentsbe OCR'd out and put into a full text search? How many searches will be going on? 

If you're asking whether or not it makes sense to store 7TB of images in the database, as opposed to storing links to
thoseimages and keeping the images themselves on a normal filesystem, there's no clear answer. Check the archives for
prosand cons of each method. 

Re: Need suggestion

From
John R Pierce
Date:
On 06/02/11 2:19 AM, Ognjen Blagojevic wrote:
> 1. Files stored on the filesystem:
> - Small database footprint
> - Faster backup, export and import
>
> 2. Files stored in the database
> - RDBMS takes care of transactions and ref. int.
> - Slower backup, export and import but all done in one step
> - Easier continuous archiving

with many terabytes of large file data accumulating, the database will
become very unweildy to do any maintenance on.  a simple pg_dump will
take many hours vs a few minutes.

I would almost certainly use a filesystem for an app like this, and just
store the metadata in the database.

--
john r pierce                            N 37, W 123
santa cruz ca                         mid-left coast


Re: Need suggestion

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Jun 01, 2011 at 10:08:14AM +0200, Carl von Clausewitz wrote:
> Hello Everyone,
>
> I got a new project, with 100 user in Europe. In this case, I need to handle
> production and sales processes an its documentations in PostgreSQL with PHP.

Something to consider too -- if you decide to store the big objects in
the database, that is -- is PostgreSQL's large object interface
<http://www.postgresql.org/docs/9.1/static/largeobjects.html>. The
problems with backup someone else mentioned in this thread would remain,
but you wouldn't have large blobs of data clobbering your "regular"
queries. You could pass the scans and pics piecemeal between client and
database without having to store them in the middleware (which may be an
advantage or a disadvantage, mind you).

Don't know whether PHP has bindings for that, though.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFN6G38Bcgs9XrR2kYRAmOyAJwIGwk57tH5X8V4uEV5c3peQv7aKACfZ+Tm
9ogbAeWTKwxM2/o7aKz9kbc=
=MMDN
-----END PGP SIGNATURE-----

Re: Need suggestion

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Following up on myself:

Just stumbled upon this in the hackers mailing list, which might be
interesting to you, since it highlights pros & cons of current
implementations:

  <http://archives.postgresql.org/pgsql-hackers/2011-06/threads.php#00049>

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFN6HEPBcgs9XrR2kYRAsg0AJ4o2fLheYZQAhpKE7cd7LWEOJc2vwCfUvnu
+Skz5eZti3cdDoode6Zu6s4=
=ImVK
-----END PGP SIGNATURE-----

Re: Need suggestion

From
Karsten Hilbert
Date:
On Fri, Jun 03, 2011 at 07:15:40AM +0200, tomas@tuxteam.de wrote:

> but you wouldn't have large blobs of data clobbering your "regular" queries.

You would want to write better queries than

    select * from my_table_with_bytea_column;

anyway.

> You could pass the scans and pics piecemeal between client and database

At least for retrieval even BYTEA can be accessed piecemeal:

    select substring(bytea_column from <start> for <number_of_bytes>)

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Need suggestion

From
Carl von Clausewitz
Date:
Thanks for the replies, and suggestion from Ognjen, Ben Chobot, John R Pierce, Tomás, and Karsten... I checked the links, and I decided, that I cannot decide :-) because I don't know, how large could be the infrastructure for this. If I store the images, and scanned docus in the database, a radically larger enviroment will be needed, than if I store only the link. My boss will decide, and I will implement anything, that he wants, I just wanted to collect some experience, that you have, and that you have provided form me, and many thanks for it :-) 

Thanks again,
Regards,
Carl. 

2011/6/3 Karsten Hilbert <Karsten.Hilbert@gmx.net>
On Fri, Jun 03, 2011 at 07:15:40AM +0200, tomas@tuxteam.de wrote:

> but you wouldn't have large blobs of data clobbering your "regular" queries.

You would want to write better queries than

       select * from my_table_with_bytea_column;

anyway.

> You could pass the scans and pics piecemeal between client and database

At least for retrieval even BYTEA can be accessed piecemeal:

       select substring(bytea_column from <start> for <number_of_bytes>)

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Need suggestion

From
Esmin Gracic
Date:
another option is using sqlite for storing images. All data is in single file. (or files if you organize it that way) easier backup etc... you have some db benefits and retaining solid speed vs file system. Haven't used this, but seems as viable option to explore.

Esmin

On Fri, Jun 3, 2011 at 6:33 PM, Carl von Clausewitz <clausewitz45@gmail.com> wrote:
Thanks for the replies, and suggestion from Ognjen, Ben Chobot, John R Pierce, Tomás, and Karsten... I checked the links, and I decided, that I cannot decide :-) because I don't know, how large could be the infrastructure for this. If I store the images, and scanned docus in the database, a radically larger enviroment will be needed, than if I store only the link. My boss will decide, and I will implement anything, that he wants, I just wanted to collect some experience, that you have, and that you have provided form me, and many thanks for it :-) 

Thanks again,
Regards,
Carl. 

2011/6/3 Karsten Hilbert <Karsten.Hilbert@gmx.net>
On Fri, Jun 03, 2011 at 07:15:40AM +0200, tomas@tuxteam.de wrote:

> but you wouldn't have large blobs of data clobbering your "regular" queries.

You would want to write better queries than

       select * from my_table_with_bytea_column;

anyway.

> You could pass the scans and pics piecemeal between client and database

At least for retrieval even BYTEA can be accessed piecemeal:

       select substring(bytea_column from <start> for <number_of_bytes>)

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Need suggestion

From
John R Pierce
Date:
On 06/03/11 3:09 PM, Esmin Gracic wrote:
> another option is using sqlite for storing images. All data is in
> single file. (or files if you organize it that way) easier backup
> etc... you have some db benefits and retaining solid speed vs file
> system. Haven't used this, but seems as viable option to explore.

a single multi-terabyte file?    what a *wonderful* idea.  *NOT*



Re: Need suggestion

From
Esmin Gracic
Date:
*(or files if you organize it that way)*
if the problem was so simple, I guess Carl would not have asked the question in the first place.

there could be one sqlite db file for each day, week or month (1TB over 365 days). Something like partitioning on date dimension.
Actually, sqlite scales well up to 2 TB so it's not so *unwonderfull* idea even using one file per year (especially on XFS file system).

I would use postgres + sqlite as image storing engine.

"...but is must be searchable, and any document must be retrieved within 1 hour..." - this is not bleeding edge requirement, so I would use filesystem if this was 10 sec or something, but having 1 hour timeframe would definitely made me choose db over filesystem.

Esmin.

On Sat, Jun 4, 2011 at 12:44 AM, John R Pierce <pierce@hogranch.com> wrote:
On 06/03/11 3:09 PM, Esmin Gracic wrote:
another option is using sqlite for storing images. All data is in single file. (or files if you organize it that way) easier backup etc... you have some db benefits and retaining solid speed vs file system. Haven't used this, but seems as viable option to explore.

a single multi-terabyte file?    what a *wonderful* idea.  *NOT*



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Need suggestion

From
Merlin Moncure
Date:
On Fri, Jun 3, 2011 at 5:09 PM, Esmin Gracic <esmin.gracic@gmail.com> wrote:
> another option is using sqlite for storing images. All data is in single
> file. (or files if you organize it that way) easier backup etc... you have
> some db benefits and retaining solid speed vs file system. Haven't used
> this, but seems as viable option to explore.

My postgres database is stored on a single file as well...on my
workstation it's /dev/sda2.  Using a loopback device I could create a
classic file.  Point being, having a single file doesn't eliminate or
simplify fragmentation and sync issues -- it just moves them from one
place to another.

sqlite has fundamentally different operational characteristics due to
it's architecture. It is unsuited for problems where a multi-user
database is typically the tool of choice for a number of reasons.  For
example, sqlite's locking model is exceptionally crude by comparison,
and intentionally so.  Being able to run inside an applications's
process is a huge asset though.

merlin