Thread: Need suggestion
Hello Everyone,
I got a new project, with 100 user in Europe. In this case, I need to handle production and sales processes an its documentations in PostgreSQL with PHP. The load of the sales process is negligible, but every user produces 2 transaction in the production process, with 10-30 scanned documents (each are 400kb - 800kb), and 30-50 high resolution pictures (each are 3-8 MB), and they wanted to upload it to 'somewhere'. 'Somewhere' could be the server files system, and a link in the PostgreSQL database for the location of the files (with some metadata), or it could be the PostgreSQL database.
My question is that: what is your opinion about to store the scanned documentation and the pictures in the database? This is a huge amount of data (between daily 188MB and 800MB data, average year is about 1 TB data), but is must be searchable, and any document must be retrieved within 1 hour. Every documentations must be stored for up to 5 years... It means the database could be about 6-7 TB large after 5 years, and then we can start to archive documents. Any other data size is negligible.
If you suggest, to store all of the data in PostgreSQL, what is your recommendation about table, index structure, clustering, archiving?
Thank you in advance!
Regards,
Carl
Carl, I don't have experience with that big databases, but I did both solutions, and here are pros of both of them: 1. Files stored on the filesystem: - Small database footprint - Faster backup, export and import 2. Files stored in the database - RDBMS takes care of transactions and ref. int. - Slower backup, export and import but all done in one step - Easier continuous archiving I slightly prefer option no. 2, since transaction handling, rollback and ref. integrity is not so easy to implement when you have two different storage systems (FS and RDB). As for indexes and tables it is not clear form your message whether you need just a regular search (field LIKE 'something'), full text search of metadata, or full text search of scanned documents (in case they are OCRed). Regards, Ognjen On 1.6.2011 10:08, Carl von Clausewitz wrote: > Hello Everyone, > > I got a new project, with 100 user in Europe. In this case, I need to > handle production and sales processes an its documentations in > PostgreSQL with PHP. The load of the sales process is negligible, but > every user produces 2 transaction in the production process, with 10-30 > scanned documents (each are 400kb - 800kb), and 30-50 high resolution > pictures (each are 3-8 MB), and they wanted to upload it to 'somewhere'. > 'Somewhere' could be the server files system, and a link in > the PostgreSQL database for the location of the files (with some > metadata), or it could be the PostgreSQL database. > > My question is that: what is your opinion about to store the scanned > documentation and the pictures in the database? This is a huge amount of > data (between daily 188MB and 800MB data, average year is about 1 TB > data), but is must be searchable, and any document must be retrieved > within 1 hour. Every documentations must be stored for up to 5 years... > It means the database could be about 6-7 TB large after 5 years, and > then we can start to archive documents. Any other data size is negligible. > > If you suggest, to store all of the data in PostgreSQL, what is your > recommendation about table, index structure, clustering, archiving? > > Thank you in advance! > Regards, > Carl
Dear Ognjen,
thank you - that was my idea too, but I've never seen such a workload like this. The docu's (which are not processed by any ocr hopefully) and the pictures are not indexed off course, just some metadatas, which are related to the exact docu, or pic For example:
productions_docu1:
-sent date
-recieved date
-type
-owner
-case_id
-etc
image_001:
-picturetaken date
-case_id
-image_type
Just these metadatas need to be searched. My questions about the structure was like this: do you recommend, to store the images and the docu's in a same table (CREATE TABLE docu_img_store (id BIGSERIAL, case_id BIGINT, content_type INTEGER, content bytea), or store it in two different tables? Is there any special settings while table creations, that I have to set for optimal work (like index, storage parameter, toast, etc).
(:-) I know, that this project could be a high value revenue for any DB consultancy related company, but this is a small country, with small project fees, and I'm employee, not a contractor at my company :-)
Thanks you in advance,
Regards,
Carl
2011/6/2 Ognjen Blagojevic <ognjen.d.blagojevic@gmail.com>
Carl,
I don't have experience with that big databases, but I did both solutions, and here are pros of both of them:
1. Files stored on the filesystem:
- Small database footprint
- Faster backup, export and import
2. Files stored in the database
- RDBMS takes care of transactions and ref. int.
- Slower backup, export and import but all done in one step
- Easier continuous archiving
I slightly prefer option no. 2, since transaction handling, rollback and ref. integrity is not so easy to implement when you have two different storage systems (FS and RDB).
As for indexes and tables it is not clear form your message whether you need just a regular search (field LIKE 'something'), full text search of metadata, or full text search of scanned documents (in case they are OCRed).
Regards,
Ognjen--
On 1.6.2011 10:08, Carl von Clausewitz wrote:Hello Everyone,
I got a new project, with 100 user in Europe. In this case, I need to
handle production and sales processes an its documentations in
PostgreSQL with PHP. The load of the sales process is negligible, but
every user produces 2 transaction in the production process, with 10-30
scanned documents (each are 400kb - 800kb), and 30-50 high resolution
pictures (each are 3-8 MB), and they wanted to upload it to 'somewhere'.
'Somewhere' could be the server files system, and a link in
the PostgreSQL database for the location of the files (with some
metadata), or it could be the PostgreSQL database.
My question is that: what is your opinion about to store the scanned
documentation and the pictures in the database? This is a huge amount of
data (between daily 188MB and 800MB data, average year is about 1 TB
data), but is must be searchable, and any document must be retrieved
within 1 hour. Every documentations must be stored for up to 5 years...
It means the database could be about 6-7 TB large after 5 years, and
then we can start to archive documents. Any other data size is negligible.
If you suggest, to store all of the data in PostgreSQL, what is your
recommendation about table, index structure, clustering, archiving?
Thank you in advance!
Regards,
Carl
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Carl, Please keep in mind I am not a Postgres expert nor consultant, I'm just sharing my experience. I would also like to hear the opinion of other people who worked on projects with similar database sizes. I would keep all files in the single table -- most probably they will be served to the user by the same code (e.g. file download servlet or something similar) so it is good if all relevant info is in one table, something like: file { id mime_type name content bytea creation_date datetime modification_date datetime creation_user modification_user } Since both image and document inherits file, you may choose any of the common RDB inheritance modeling strategies (one table per hierarchy, one table per class...), but since there is just a few fields, I would put everything in the same table. Consider cardinality between cases and files/users. Can one file be related with two cases and so on... Toast table will be splitted in 1GB pieces. Create indexes considering ways your users will browse or search data. Regards, Ognjen On 2.6.2011 12:22, Carl von Clausewitz wrote: > Dear Ognjen, > > thank you - that was my idea too, but I've never seen such a workload > like this. The docu's (which are not processed by any ocr hopefully) and > the pictures are not indexed off course, just some metadatas, which are > related to the exact docu, or pic For example: > > productions_docu1: > -sent date > -recieved date > -type > -owner > -case_id > -etc > > image_001: > -picturetaken date > -case_id > -image_type > > Just these metadatas need to be searched. My questions about the > structure was like this: do you recommend, to store the images and the > docu's in a same table (CREATE TABLE docu_img_store (id > BIGSERIAL, case_id BIGINT, content_type INTEGER, content bytea), or > store it in two different tables? Is there any special settings while > table creations, that I have to set for optimal work (like index, > storage parameter, toast, etc). > > (:-) I know, that this project could be a high value revenue for any DB > consultancy related company, but this is a small country, with small > project fees, and I'm employee, not a contractor at my company :-) > > Thanks you in advance, > Regards, > Carl > > > 2011/6/2 Ognjen Blagojevic <ognjen.d.blagojevic@gmail.com > <mailto:ognjen.d.blagojevic@gmail.com>> > > Carl, > > I don't have experience with that big databases, but I did both > solutions, and here are pros of both of them: > > 1. Files stored on the filesystem: > - Small database footprint > - Faster backup, export and import > > 2. Files stored in the database > - RDBMS takes care of transactions and ref. int. > - Slower backup, export and import but all done in one step > - Easier continuous archiving > > I slightly prefer option no. 2, since transaction handling, rollback > and ref. integrity is not so easy to implement when you have two > different storage systems (FS and RDB). > > As for indexes and tables it is not clear form your message whether > you need just a regular search (field LIKE 'something'), full text > search of metadata, or full text search of scanned documents (in > case they are OCRed). > > Regards, > Ognjen > > > > On 1.6.2011 10:08, Carl von Clausewitz wrote: > > Hello Everyone, > > I got a new project, with 100 user in Europe. In this case, I > need to > handle production and sales processes an its documentations in > PostgreSQL with PHP. The load of the sales process is > negligible, but > every user produces 2 transaction in the production process, > with 10-30 > scanned documents (each are 400kb - 800kb), and 30-50 high > resolution > pictures (each are 3-8 MB), and they wanted to upload it to > 'somewhere'. > 'Somewhere' could be the server files system, and a link in > the PostgreSQL database for the location of the files (with some > metadata), or it could be the PostgreSQL database. > > My question is that: what is your opinion about to store the scanned > documentation and the pictures in the database? This is a huge > amount of > data (between daily 188MB and 800MB data, average year is about 1 TB > data), but is must be searchable, and any document must be retrieved > within 1 hour. Every documentations must be stored for up to 5 > years... > It means the database could be about 6-7 TB large after 5 years, and > then we can start to archive documents. Any other data size is > negligible. > > If you suggest, to store all of the data in PostgreSQL, what is your > recommendation about table, index structure, clustering, archiving? > > Thank you in advance! > Regards, > Carl > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org > <mailto:pgsql-general@postgresql.org>) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > >
On Jun 1, 2011, at 1:08 AM, Carl von Clausewitz wrote: > Hello Everyone, > > I got a new project, with 100 user in Europe. In this case, I need to handle production and sales processes an its documentationsin PostgreSQL with PHP. The load of the sales process is negligible, but every user produces 2 transactionin the production process, with 10-30 scanned documents (each are 400kb - 800kb), and 30-50 high resolution pictures(each are 3-8 MB), and they wanted to upload it to 'somewhere'. 'Somewhere' could be the server files system, anda link in the PostgreSQL database for the location of the files (with some metadata), or it could be the PostgreSQL database. > > My question is that: what is your opinion about to store the scanned documentation and the pictures in the database? Thisis a huge amount of data (between daily 188MB and 800MB data, average year is about 1 TB data), but is must be searchable,and any document must be retrieved within 1 hour. Every documentations must be stored for up to 5 years... Itmeans the database could be about 6-7 TB large after 5 years, and then we can start to archive documents. Any other datasize is negligible. > > If you suggest, to store all of the data in PostgreSQL, what is your recommendation about table, index structure, clustering,archiving? So, you're mostly storing ~1TB of images/year? That doesn't seem so bad. How will the documents be searched? Will their contentsbe OCR'd out and put into a full text search? How many searches will be going on? If you're asking whether or not it makes sense to store 7TB of images in the database, as opposed to storing links to thoseimages and keeping the images themselves on a normal filesystem, there's no clear answer. Check the archives for prosand cons of each method.
On 06/02/11 2:19 AM, Ognjen Blagojevic wrote: > 1. Files stored on the filesystem: > - Small database footprint > - Faster backup, export and import > > 2. Files stored in the database > - RDBMS takes care of transactions and ref. int. > - Slower backup, export and import but all done in one step > - Easier continuous archiving with many terabytes of large file data accumulating, the database will become very unweildy to do any maintenance on. a simple pg_dump will take many hours vs a few minutes. I would almost certainly use a filesystem for an app like this, and just store the metadata in the database. -- john r pierce N 37, W 123 santa cruz ca mid-left coast
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, Jun 01, 2011 at 10:08:14AM +0200, Carl von Clausewitz wrote: > Hello Everyone, > > I got a new project, with 100 user in Europe. In this case, I need to handle > production and sales processes an its documentations in PostgreSQL with PHP. Something to consider too -- if you decide to store the big objects in the database, that is -- is PostgreSQL's large object interface <http://www.postgresql.org/docs/9.1/static/largeobjects.html>. The problems with backup someone else mentioned in this thread would remain, but you wouldn't have large blobs of data clobbering your "regular" queries. You could pass the scans and pics piecemeal between client and database without having to store them in the middleware (which may be an advantage or a disadvantage, mind you). Don't know whether PHP has bindings for that, though. Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFN6G38Bcgs9XrR2kYRAmOyAJwIGwk57tH5X8V4uEV5c3peQv7aKACfZ+Tm 9ogbAeWTKwxM2/o7aKz9kbc= =MMDN -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Following up on myself: Just stumbled upon this in the hackers mailing list, which might be interesting to you, since it highlights pros & cons of current implementations: <http://archives.postgresql.org/pgsql-hackers/2011-06/threads.php#00049> Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFN6HEPBcgs9XrR2kYRAsg0AJ4o2fLheYZQAhpKE7cd7LWEOJc2vwCfUvnu +Skz5eZti3cdDoode6Zu6s4= =ImVK -----END PGP SIGNATURE-----
On Fri, Jun 03, 2011 at 07:15:40AM +0200, tomas@tuxteam.de wrote: > but you wouldn't have large blobs of data clobbering your "regular" queries. You would want to write better queries than select * from my_table_with_bytea_column; anyway. > You could pass the scans and pics piecemeal between client and database At least for retrieval even BYTEA can be accessed piecemeal: select substring(bytea_column from <start> for <number_of_bytes>) Karsten -- GPG key ID E4071346 @ gpg-keyserver.de E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Thanks for the replies, and suggestion from Ognjen, Ben Chobot, John R Pierce, Tomás, and Karsten... I checked the links, and I decided, that I cannot decide :-) because I don't know, how large could be the infrastructure for this. If I store the images, and scanned docus in the database, a radically larger enviroment will be needed, than if I store only the link. My boss will decide, and I will implement anything, that he wants, I just wanted to collect some experience, that you have, and that you have provided form me, and many thanks for it :-)
Thanks again,
Regards,
Carl.
2011/6/3 Karsten Hilbert <Karsten.Hilbert@gmx.net>
On Fri, Jun 03, 2011 at 07:15:40AM +0200, tomas@tuxteam.de wrote:You would want to write better queries than
> but you wouldn't have large blobs of data clobbering your "regular" queries.
select * from my_table_with_bytea_column;
anyway.At least for retrieval even BYTEA can be accessed piecemeal:
> You could pass the scans and pics piecemeal between client and database
select substring(bytea_column from <start> for <number_of_bytes>)
Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
another option is using sqlite for storing images. All data is in single file. (or files if you organize it that way) easier backup etc... you have some db benefits and retaining solid speed vs file system. Haven't used this, but seems as viable option to explore.
Esmin
Esmin
On Fri, Jun 3, 2011 at 6:33 PM, Carl von Clausewitz <clausewitz45@gmail.com> wrote:
Thanks for the replies, and suggestion from Ognjen, Ben Chobot, John R Pierce, Tomás, and Karsten... I checked the links, and I decided, that I cannot decide :-) because I don't know, how large could be the infrastructure for this. If I store the images, and scanned docus in the database, a radically larger enviroment will be needed, than if I store only the link. My boss will decide, and I will implement anything, that he wants, I just wanted to collect some experience, that you have, and that you have provided form me, and many thanks for it :-)Thanks again,Regards,Carl.2011/6/3 Karsten Hilbert <Karsten.Hilbert@gmx.net>On Fri, Jun 03, 2011 at 07:15:40AM +0200, tomas@tuxteam.de wrote:You would want to write better queries than
> but you wouldn't have large blobs of data clobbering your "regular" queries.
select * from my_table_with_bytea_column;
anyway.At least for retrieval even BYTEA can be accessed piecemeal:
> You could pass the scans and pics piecemeal between client and database
select substring(bytea_column from <start> for <number_of_bytes>)
Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 06/03/11 3:09 PM, Esmin Gracic wrote: > another option is using sqlite for storing images. All data is in > single file. (or files if you organize it that way) easier backup > etc... you have some db benefits and retaining solid speed vs file > system. Haven't used this, but seems as viable option to explore. a single multi-terabyte file? what a *wonderful* idea. *NOT*
*(or files if you organize it that way)*
if the problem was so simple, I guess Carl would not have asked the question in the first place.
there could be one sqlite db file for each day, week or month (1TB over 365 days). Something like partitioning on date dimension.
Actually, sqlite scales well up to 2 TB so it's not so *unwonderfull* idea even using one file per year (especially on XFS file system).
I would use postgres + sqlite as image storing engine.
"...but is must be searchable, and any document must be retrieved within 1 hour..." - this is not bleeding edge requirement, so I would use filesystem if this was 10 sec or something, but having 1 hour timeframe would definitely made me choose db over filesystem.
Esmin.
if the problem was so simple, I guess Carl would not have asked the question in the first place.
there could be one sqlite db file for each day, week or month (1TB over 365 days). Something like partitioning on date dimension.
Actually, sqlite scales well up to 2 TB so it's not so *unwonderfull* idea even using one file per year (especially on XFS file system).
I would use postgres + sqlite as image storing engine.
"...but is must be searchable, and any document must be retrieved within 1 hour..." - this is not bleeding edge requirement, so I would use filesystem if this was 10 sec or something, but having 1 hour timeframe would definitely made me choose db over filesystem.
Esmin.
On Sat, Jun 4, 2011 at 12:44 AM, John R Pierce <pierce@hogranch.com> wrote:
On 06/03/11 3:09 PM, Esmin Gracic wrote:a single multi-terabyte file? what a *wonderful* idea. *NOT*another option is using sqlite for storing images. All data is in single file. (or files if you organize it that way) easier backup etc... you have some db benefits and retaining solid speed vs file system. Haven't used this, but seems as viable option to explore.
--Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Jun 3, 2011 at 5:09 PM, Esmin Gracic <esmin.gracic@gmail.com> wrote: > another option is using sqlite for storing images. All data is in single > file. (or files if you organize it that way) easier backup etc... you have > some db benefits and retaining solid speed vs file system. Haven't used > this, but seems as viable option to explore. My postgres database is stored on a single file as well...on my workstation it's /dev/sda2. Using a loopback device I could create a classic file. Point being, having a single file doesn't eliminate or simplify fragmentation and sync issues -- it just moves them from one place to another. sqlite has fundamentally different operational characteristics due to it's architecture. It is unsuited for problems where a multi-user database is typically the tool of choice for a number of reasons. For example, sqlite's locking model is exceptionally crude by comparison, and intentionally so. Being able to run inside an applications's process is a huge asset though. merlin