Thread: One huge db vs many small dbs
Hello,
We are starting a new project to deploy a solution in cloud with the possibility to be used for 2.000+ clients. Each of this clients will use several tables to store their information (our model has about 500+ tables but there's less than 100 core table with heavy use). Also the projected ammout of information per client could be from small (few hundreds tuples/MB) to huge (few millions tuples/GB).
One of the many questions we have is about performance of the db if we work with only one (using a ClientID to separete de clients info) or thousands of separate dbs. The management of the dbs is not a huge concert as we have an automated tool.
At Google there's lots of cases about this subject but none have a scenario that matchs with the one I presented above, so I would like to know if anyone here has a similar situation or knowledgement and could share some thoughts.
Thanks
Max
On Thu, Dec 5, 2013 at 2:42 AM, Max <maxabbr@yahoo.com.br> wrote:
We are starting a new project to deploy a solution in cloud with the possibility to be used for 2.000+ clients. Each of this clients will use several tables to store their information (our model has about 500+ tables but there's less than 100 core table with heavy use). Also the projected ammout of information per client could be from small (few hundreds tuples/MB) to huge (few millions tuples/GB).One of the many questions we have is about performance of the db if we work with only one (using a ClientID to separete de clients info) or thousands of separate dbs. The management of the dbs is not a huge concert as we have an automated tool.
More details would be helpful, some of which could include:
how much memory is dedicated to Postgresql,
how many servers,
are you using replication/hot standby,
what are you data access patterns like (mostly inserts/lots of concurrent queries, a handful of users versus hundreds querying at the same time),
what are your plans for backups,
what are you planning to do to archive older data?
how much memory is dedicated to Postgresql,
how many servers,
are you using replication/hot standby,
what are you data access patterns like (mostly inserts/lots of concurrent queries, a handful of users versus hundreds querying at the same time),
what are your plans for backups,
what are you planning to do to archive older data?
Also, have you considered separate schemas rather than separate databases?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/05/2013 02:42 AM, Max wrote: > Hello, > > We are starting a new project to deploy a solution in cloud with > the possibility to be used for 2.000+ clients. Each of this clients > will use several tables to store their information (our model has > about 500+ tables but there's less than 100 core table with heavy > use). Also the projected ammout of information per client could be > from small (few hundreds tuples/MB) to huge (few millions > tuples/GB). > > One of the many questions we have is about performance of the db if > we work with only one (using a ClientID to separete de clients > info) or thousands of separate dbs. The management of the dbs is > not a huge concert as we have an automated tool. If I understand correctly: 500 tables x 2000 = 1 million tables Even if not heavily used, in my experience 1 million tables in a single database will cause problems for you: 1) on Postgres versions < 9.3, pg_dump takes *long* time (think days) 2) psql tab complete really slow 3) probably others I'm not thinking of right now... There are advantages to not needing to manage so many databases, but I would test it carefully before committing. Joe - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, & 24x7 Support -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJSoKJ9AAoJEDfy90M199hlSlgP/10lk4HZ3lga1RMMtzAlzYul 92NIS1MIDQLb/Uo6DPsbchh9aAU1MZjuC0fuTwOAAjfXMgyKO9AbEgbkf1PlLn1R LrG/pOdzBEJp67fIqWckBwMKzE8RjetQnyDykkW893xgRE4woyMtPdk1ywPT1iFK IX9HgzTEhnHH4FSkFcxRtqWmgJX5eigKEXfC8wLE8//8VJye0Ej0wS04PXPkkKvM DBOJ8ba9A853nl4F4l26jmoJ6iiMJqsxHYJsJMX45tFDsyuvf4E4r9y9CHbXlEw0 1o/DTLHqKK2uDniz3pVnCuqHxtPr0IoD7imkh5gGgi40VKBzpCzfNg9NQMw02OL2 wpvJJeWynKwny/3BTN0ZW5mLb1iP1PLZRsr1ivwbVRUARfYoShWRB1fMruuXSvV4 A7hO4tGDCrvB/R2BxS0/ssLvO9vxX+sHTleAP4Uoz2kv5MBuJRRZsFlb8ejOB3gg iWb4QJOh93NVJgW6M2y496d8Zoz2Vq2o8QUOOzh49QmQjQE3tyXgsO4VmrpUxwHg zK0d+Qlkua9U433+dNQBs2i4mf1K58LJ0uQde2ibULk6Tgq+uJePmWfzKPhkwamV 1d3Iu7UgE5JigzmdWJy4GdJiVGLsTdOtGFHJhMEIFYZ/pHF8WoAtlx6D1SkaCNDr IiR6V5n+xDuuPkQcDBp0 =FGZi -----END PGP SIGNATURE-----
maxabbr wrote > Hello, > > We are starting a new project to deploy a solution in cloud with the > possibility to be used for 2.000+ clients. Each of this clients will use > several tables to store their information (our model has about 500+ tables > but there's less than 100 core table with heavy use). Also the projected > ammout of information per client could be from small (few hundreds > tuples/MB) to huge (few millions tuples/GB). > > One of the many questions we have is about performance of the db if we > work with only one (using a ClientID to separete de clients info) or > thousands of separate dbs. The management of the dbs is not a huge concert > as we have an automated tool. > > At Google there's lots of cases about this subject but none have a > scenario that matchs with the one I presented above, so I would like to > know if anyone here has a similar situation or knowledgement and could > share some thoughts. > > > Thanks > > Max My untested thoughts here is a hybrid approach. Allow any one database to contain any number of stores on a common schema with a controlling clientId column. But allow for multiple databases. Furthermore, any non-client shared data could belong to a separate database of reference with the possibility of caching said data in each of the client databases where applicable. Though until your needs dictate that level of complexity you can have just one data and schema set for all clients. While row-level-security will make this more tenable generally this model works best if all client access is made via middleware. You mitigate that by using separate databases for any clients with a higher risk profile (i.e., larger datasets, direct access to the DB, etc...) Adding in clientId overhead will degrade performance somewhat but increase your flexibility considerably. That is often a worthwhile trade-off to make even if you decided to create separate schemas/databases. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/One-huge-db-vs-many-small-dbs-tp5781827p5781924.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql- > performance-owner@postgresql.org] On Behalf Of Max > Sent: Thursday, December 05, 2013 5:42 AM > To: pgsql-performance@postgresql.org > Subject: [PERFORM] One huge db vs many small dbs > > Hello, > > > We are starting a new project to deploy a solution in cloud with the possibility > to be used for 2.000+ clients. Each of this clients will use several tables to > store their information (our model has about 500+ tables but there's less > than 100 core table with heavy use). Also the projected ammout of > information per client could be from small (few hundreds tuples/MB) to > huge (few millions tuples/GB). > > > One of the many questions we have is about performance of the db if we > work with only one (using a ClientID to separete de clients info) or thousands > of separate dbs. The management of the dbs is not a huge concert as we > have an automated tool. If you are planning on using persisted connections, the large number of DB approach is going to have a significant disadvantage. You cannot pool connections between databases. So if you have 2000 databases, you are going to need a minimumof 2000 connections to service those database (assuming you want to keep at least one active connection open per clientat a time). Brad.
>> One of the many questions we have is about performance of the db if we >> work with only one (using a ClientID to separete de clients info) or thousands >> of separate dbs. The management of the dbs is not a huge concert as we >> have an automated tool. > > If you are planning on using persisted connections, the large number of DB approach is going to have a significant disadvantage. You cannot pool connections between databases. So if you have 2000 databases, you are going to need a minimumof 2000 connections to service those database (assuming you want to keep at least one active connection open per clientat a time). That isn't exactly true. You could run multiple poolers. JD > > Brad. > > > -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
On 12/05/2013 02:42 AM, Max wrote: > Hello, > > We are starting a new project to deploy a solution in cloud with the possibility to be used for 2.000+ clients. Each ofthis clients will use several tables to store their information (our model has about 500+ tables but there's less than100 core table with heavy use). Also the projected ammout of information per client could be from small (few hundredstuples/MB) to huge (few millions tuples/GB). > > One of the many questions we have is about performance of the db if we work with only one (using a ClientID to separetede clients info) or thousands of separate dbs. The management of the dbs is not a huge concert as we have an automatedtool. In addition to the excellent advice from others, I'll speak from experience: The best model here, if you can implement it, is to implement shared tables for all customers, but have a way you can "break out" customers to their own database(s). This allows you to start with a single database, but to shard out your larger customers as they grow. The small customers will always stay on the same DB. That means you'll also treat the different customers as different DB connections from day 1. That way, when you move the large customers out to separate servers, you don't have to change the way the app connects to the database. If you can't implement shared tables, I'm going to say go for separate databases. This will mean lots of additional storage space -- the per-DB overhead by itself will be 100GB -- but otherwise you'll be grappling with the issues involved in having a million tables, which Joe Conway outlined. But if you don't have shared tables, your huge schema is always going to cause you to waste resources on the smaller customers. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
2013/12/6 Josh Berkus <josh@agliodbs.com>
+1
On 12/05/2013 02:42 AM, Max wrote:
> Hello,
>
> We are starting a new project to deploy a solution in cloud with the possibility to be used for 2.000+ clients. Each of this clients will use several tables to store their information (our model has about 500+ tables but there's less than 100 core table with heavy use). Also the projected ammout of information per client could be from small (few hundreds tuples/MB) to huge (few millions tuples/GB).
>
> One of the many questions we have is about performance of the db if we work with only one (using a ClientID to separete de clients info) or thousands of separate dbs. The management of the dbs is not a huge concert as we have an automated tool.
In addition to the excellent advice from others, I'll speak from experience:
The best model here, if you can implement it, is to implement shared
tables for all customers, but have a way you can "break out" customers
to their own database(s). This allows you to start with a single
database, but to shard out your larger customers as they grow. The
small customers will always stay on the same DB.
That means you'll also treat the different customers as different DB
connections from day 1. That way, when you move the large customers out
to separate servers, you don't have to change the way the app connects
to the database.
If you can't implement shared tables, I'm going to say go for separate
databases. This will mean lots of additional storage space -- the
per-DB overhead by itself will be 100GB -- but otherwise you'll be
grappling with the issues involved in having a million tables, which Joe
Conway outlined. But if you don't have shared tables, your huge schema
is always going to cause you to waste resources on the smaller customers.
+1
Pavel
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Thu, Dec 05, 2013 at 02:42:10AM -0800, Max wrote: > Hello, > > We are starting a new project to deploy a solution in cloud with the > possibility to be used for 2.000+ clients. Each of this clients will use > several tables to store their information (our model has about 500+ > tables but there's less than 100 core table with heavy use). Also the > projected ammout of information per client could be from small (few > hundreds tuples/MB) to huge (few millions tuples/GB). > > One of the many questions we have is about performance of the db if we > work with only one (using a ClientID to separete de clients info) or > thousands of separate dbs. The management of the dbs is not a huge > concert as we have an automated tool. > > At Google there's lots of cases about this subject but none have a > scenario that matchs with the one I presented above, so I would like to > know if anyone here has a similar situation or knowledgement and could > share some thoughts. We have made very good experiences with putting each client into its own database. We have a few thousand dbs now on 5 machines (each 1TB capacity) where each client/db is between 100MB and 100GB of data. As Josh said you have to consider the db overhead. If you have only a few MBs of data per client it might not be worth it. (An empty DB shows up with 6MB size in psql \l+). The good thing with a db per client is you can easily scale horizontically by just adding machines. We have between 100 and 1000 dbs per machine, depending on client size. There's no real limit on growth regarding client numbers, we can just always add more machines. We can also easily move clients between machines with pg_dump piped into pg_restore. I would not advise using one schema per client, because then you lose the ability to really use schemas within each client 'namespace'. Afaik schemas cannot be stacked in Postgres. Schemas are very helpful to seperate different applications or to implement versioning for complex views/functions, so let's not waste them for partitioning. Further things we learned: - "CREATE DATABASE foo TEMPLATE bar" is a nice way to cleanly create a new partition/client based on a template database. - On a very busy server (I/O wise) CREATE DATABASE can take a while to complete, due to the enforced CHECKPOINT when creating a new DB. We worked around this by creating empty dbs from the template beforehand, allocating (renaming) them on demand and periodically restocking those spare dbs. - pg_dump/pg_restore on individual client dbs is a neat way to implement backup/restore. It allows you to backup all clients sequentially as well as concurrently (especially from multiple machines) depending on your requirements. - When partitioning into databases it's not trivial to reference data in other databases. E.g. you can't have foreign keys to your main db (where you track all clients and their dbs). This could probably be worked around with dblink / foreign data wrappers if necessary. - We just completed painless migration from 9.0 to 9.3 simply by installing 9.3 next to 9.0 on all machines and selectively migrating individual client dbs with pg_dump | pg_restore over a period of 6 weeks. (We did not notice any problems btw). - Queries over the data of all clients (e.g. for internal monitoring or statistics) naturally take a while as you'll have to connect to all individual dbs and then manually aggregate the result from each one. - Schema changes are not trivial as you need to develop tools to apply them to all client dbs and template dbs (in our case). It gets tricky when the process in interrupted and there are race conditions when new dbs are created in the process that you need to protect against. - Deleting all data of an individual client is a simple as dropping the db. Hope that helps. Oliver