Thread: One huge db vs many small dbs

One huge db vs many small dbs

From
Max
Date:
Hello,

We are starting a new project to deploy a solution in cloud with the possibility to be used for 2.000+ clients. Each of this clients will use several tables to store their information (our model has about 500+ tables but there's less than 100 core table with heavy use). Also the projected ammout of information per client could be from small (few hundreds tuples/MB) to huge (few millions tuples/GB).

One of the many questions we have is about performance of the db if we work with only one (using a ClientID to separete de clients info) or thousands of separate dbs. The management of the dbs is not a huge concert as we have an automated tool.

At Google there's lots of cases about this subject but none have a scenario that matchs with the one I presented above, so I would like to know if anyone here has a similar situation or knowledgement and could share some thoughts.

Thanks

Max

Re: One huge db vs many small dbs

From
bricklen
Date:

On Thu, Dec 5, 2013 at 2:42 AM, Max <maxabbr@yahoo.com.br> wrote:
We are starting a new project to deploy a solution in cloud with the possibility to be used for 2.000+ clients. Each of this clients will use several tables to store their information (our model has about 500+ tables but there's less than 100 core table with heavy use). Also the projected ammout of information per client could be from small (few hundreds tuples/MB) to huge (few millions tuples/GB).

One of the many questions we have is about performance of the db if we work with only one (using a ClientID to separete de clients info) or thousands of separate dbs. The management of the dbs is not a huge concert as we have an automated tool.

More details would be helpful, some of which could include:
how much memory is dedicated to Postgresql,
how many servers,
are you using replication/hot standby,
what are you data access patterns like (mostly inserts/lots of concurrent queries, a handful of users versus hundreds querying at the same time),
what are your plans for backups,
what are you planning to do to archive older data?
Also, have you considered separate schemas rather than separate databases?

Re: One huge db vs many small dbs

From
Joe Conway
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/05/2013 02:42 AM, Max wrote:
> Hello,
>
> We are starting a new project to deploy a solution in cloud with
> the possibility to be used for 2.000+ clients. Each of this clients
> will use several tables to store their information (our model has
> about 500+ tables but there's less than 100 core table with heavy
> use). Also the projected ammout of information per client could be
> from small (few hundreds tuples/MB) to huge (few millions
> tuples/GB).
>
> One of the many questions we have is about performance of the db if
> we work with only one (using a ClientID to separete de clients
> info) or thousands of separate dbs. The management of the dbs is
> not a huge concert as we have an automated tool.

If I understand correctly: 500 tables x 2000 = 1 million tables

Even if not heavily used, in my experience 1 million tables in a
single database will cause problems for you:
1) on Postgres versions < 9.3, pg_dump takes *long* time (think days)
2) psql tab complete really slow
3) probably others I'm not thinking of right now...

There are advantages to not needing to manage so many databases, but I
would test it carefully before committing.

Joe

- --
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJSoKJ9AAoJEDfy90M199hlSlgP/10lk4HZ3lga1RMMtzAlzYul
92NIS1MIDQLb/Uo6DPsbchh9aAU1MZjuC0fuTwOAAjfXMgyKO9AbEgbkf1PlLn1R
LrG/pOdzBEJp67fIqWckBwMKzE8RjetQnyDykkW893xgRE4woyMtPdk1ywPT1iFK
IX9HgzTEhnHH4FSkFcxRtqWmgJX5eigKEXfC8wLE8//8VJye0Ej0wS04PXPkkKvM
DBOJ8ba9A853nl4F4l26jmoJ6iiMJqsxHYJsJMX45tFDsyuvf4E4r9y9CHbXlEw0
1o/DTLHqKK2uDniz3pVnCuqHxtPr0IoD7imkh5gGgi40VKBzpCzfNg9NQMw02OL2
wpvJJeWynKwny/3BTN0ZW5mLb1iP1PLZRsr1ivwbVRUARfYoShWRB1fMruuXSvV4
A7hO4tGDCrvB/R2BxS0/ssLvO9vxX+sHTleAP4Uoz2kv5MBuJRRZsFlb8ejOB3gg
iWb4QJOh93NVJgW6M2y496d8Zoz2Vq2o8QUOOzh49QmQjQE3tyXgsO4VmrpUxwHg
zK0d+Qlkua9U433+dNQBs2i4mf1K58LJ0uQde2ibULk6Tgq+uJePmWfzKPhkwamV
1d3Iu7UgE5JigzmdWJy4GdJiVGLsTdOtGFHJhMEIFYZ/pHF8WoAtlx6D1SkaCNDr
IiR6V5n+xDuuPkQcDBp0
=FGZi
-----END PGP SIGNATURE-----


Re: One huge db vs many small dbs

From
David Johnston
Date:
maxabbr wrote
> Hello,
>
> We are starting a new project to deploy a solution in cloud with the
> possibility to be used for 2.000+ clients. Each of this clients will use
> several tables to store their information (our model has about 500+ tables
> but there's less than 100 core table with heavy use). Also the projected
> ammout of information per client could be from small (few hundreds
> tuples/MB) to huge (few millions tuples/GB).
>
> One of the many questions we have is about performance of the db if we
> work with only one (using a ClientID to separete de clients info) or
> thousands of separate dbs. The management of the dbs is not a huge concert
> as we have an automated tool.
>
> At Google there's lots of cases about this subject but none have a
> scenario that matchs with the one I presented above, so I would like to
> know if anyone here has a similar situation or knowledgement and could
> share some thoughts.
>
>
> Thanks
>
> Max

My untested thoughts here is a hybrid approach.  Allow any one database to
contain any number of stores on a common schema with a controlling clientId
column.  But allow for multiple databases.  Furthermore, any non-client
shared data could belong to a separate database of reference with the
possibility of caching said data in each of the client databases where
applicable.

Though until your needs dictate that level of complexity you can have just
one data and schema set for all clients.

While row-level-security will make this more tenable generally this model
works best if all client access is made via middleware.  You mitigate that
by using separate databases for any clients with a higher risk profile
(i.e., larger datasets, direct access to the DB, etc...)

Adding in clientId overhead will degrade performance somewhat but increase
your flexibility considerably.  That is often a worthwhile trade-off to make
even if you decided to create separate schemas/databases.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/One-huge-db-vs-many-small-dbs-tp5781827p5781924.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: One huge db vs many small dbs

From
"Nicholson, Brad (Toronto, ON, CA)"
Date:
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-
> performance-owner@postgresql.org] On Behalf Of Max
> Sent: Thursday, December 05, 2013 5:42 AM
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] One huge db vs many small dbs
>
> Hello,
>
>
> We are starting a new project to deploy a solution in cloud with the possibility
> to be used for 2.000+ clients. Each of this clients will use several tables to
> store their information (our model has about 500+ tables but there's less
> than 100 core table with heavy use). Also the projected ammout of
> information per client could be from small (few hundreds tuples/MB) to
> huge (few millions tuples/GB).
>
>
> One of the many questions we have is about performance of the db if we
> work with only one (using a ClientID to separete de clients info) or thousands
> of separate dbs. The management of the dbs is not a huge concert as we
> have an automated tool.

If you are planning on using persisted connections, the large number of DB approach is going to have a significant
disadvantage. You cannot pool connections between databases.  So if you have 2000 databases, you are going to need a
minimumof 2000 connections to service those database (assuming you want to keep at least one active connection open per
clientat a time). 

Brad.



Re: One huge db vs many small dbs

From
"Joshua D. Drake"
Date:

>> One of the many questions we have is about performance of the db if we
>> work with only one (using a ClientID to separete de clients info) or thousands
>> of separate dbs. The management of the dbs is not a huge concert as we
>> have an automated tool.
>
> If you are planning on using persisted connections, the large number of DB approach is going to have a significant
disadvantage. You cannot pool connections between databases.  So if you have 2000 databases, you are going to need a
minimumof 2000 connections to service those database (assuming you want to keep at least one active connection open per
clientat a time). 

That isn't exactly true. You could run multiple poolers.

JD

>
> Brad.
>
>
>


--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
    a rose in the deeps of my heart. - W.B. Yeats


Re: One huge db vs many small dbs

From
Josh Berkus
Date:
On 12/05/2013 02:42 AM, Max wrote:
> Hello,
>
> We are starting a new project to deploy a solution in cloud with the possibility to be used for 2.000+ clients. Each
ofthis clients will use several tables to store their information (our model has about 500+ tables but there's less
than100 core table with heavy use). Also the projected ammout of information per client could be from small (few
hundredstuples/MB) to huge (few millions tuples/GB). 
>
> One of the many questions we have is about performance of the db if we work with only one (using a ClientID to
separetede clients info) or thousands of separate dbs. The management of the dbs is not a huge concert as we have an
automatedtool. 

In addition to the excellent advice from others, I'll speak from experience:

The best model here, if you can implement it, is to implement shared
tables for all customers, but have a way you can "break out" customers
to their own database(s).  This allows you to start with a single
database, but to shard out your larger customers as they grow.  The
small customers will always stay on the same DB.

That means you'll also treat the different customers as different DB
connections from day 1.  That way, when you move the large customers out
to separate servers, you don't have to change the way the app connects
to the database.

If you can't implement shared tables, I'm going to say go for separate
databases.  This will mean lots of additional storage space -- the
per-DB overhead by itself will be 100GB -- but otherwise you'll be
grappling with the issues involved in having a million tables, which Joe
Conway outlined.  But if you don't have shared tables, your huge schema
is always going to cause you to waste resources on the smaller customers.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: One huge db vs many small dbs

From
Pavel Stehule
Date:



2013/12/6 Josh Berkus <josh@agliodbs.com>
On 12/05/2013 02:42 AM, Max wrote:
> Hello,
>
> We are starting a new project to deploy a solution in cloud with the possibility to be used for 2.000+ clients. Each of this clients will use several tables to store their information (our model has about 500+ tables but there's less than 100 core table with heavy use). Also the projected ammout of information per client could be from small (few hundreds tuples/MB) to huge (few millions tuples/GB).
>
> One of the many questions we have is about performance of the db if we work with only one (using a ClientID to separete de clients info) or thousands of separate dbs. The management of the dbs is not a huge concert as we have an automated tool.

In addition to the excellent advice from others, I'll speak from experience:

The best model here, if you can implement it, is to implement shared
tables for all customers, but have a way you can "break out" customers
to their own database(s).  This allows you to start with a single
database, but to shard out your larger customers as they grow.  The
small customers will always stay on the same DB.

That means you'll also treat the different customers as different DB
connections from day 1.  That way, when you move the large customers out
to separate servers, you don't have to change the way the app connects
to the database.

If you can't implement shared tables, I'm going to say go for separate
databases.  This will mean lots of additional storage space -- the
per-DB overhead by itself will be 100GB -- but otherwise you'll be
grappling with the issues involved in having a million tables, which Joe
Conway outlined.  But if you don't have shared tables, your huge schema
is always going to cause you to waste resources on the smaller customers.


+1

Pavel
 

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: One huge db vs many small dbs

From
Oliver Seemann
Date:
On Thu, Dec 05, 2013 at 02:42:10AM -0800, Max wrote:
> Hello,
>
> We are starting a new project to deploy a solution in cloud with the
> possibility to be used for 2.000+ clients. Each of this clients will use
> several tables to store their information (our model has about 500+
> tables but there's less than 100 core table with heavy use). Also the
> projected ammout of information per client could be from small (few
> hundreds tuples/MB) to huge (few millions tuples/GB).
>
> One of the many questions we have is about performance of the db if we
> work with only one (using a ClientID to separete de clients info) or
> thousands of separate dbs. The management of the dbs is not a huge
> concert as we have an automated tool.
>
> At Google there's lots of cases about this subject but none have a
> scenario that matchs with the one I presented above, so I would like to
> know if anyone here has a similar situation or knowledgement and could
> share some thoughts.


We have made very good experiences with putting each client into its own
database. We have a few thousand dbs now on 5 machines (each 1TB capacity)
where each client/db is between 100MB and 100GB of data.
As Josh said you have to consider the db overhead. If you have only a few
MBs of data per client it might not be worth it. (An empty DB shows up with
6MB size in psql \l+).

The good thing with a db per client is you can easily scale horizontically
by just adding machines. We have between 100 and 1000 dbs per machine,
depending on client size. There's no real limit on growth regarding client
numbers, we can just always add more machines. We can also easily move
clients between machines with pg_dump piped into pg_restore.

I would not advise using one schema per client, because then you lose the
ability to really use schemas within each client 'namespace'. Afaik schemas
cannot be stacked in Postgres. Schemas are very helpful to seperate
different applications or to implement versioning for complex
views/functions, so let's not waste them for partitioning.

Further things we learned:

- "CREATE DATABASE foo TEMPLATE bar" is a nice way to cleanly create a new
partition/client based on a template database.

- On a very busy server (I/O wise) CREATE DATABASE can take a while to
complete, due to the enforced CHECKPOINT when creating a new DB. We worked
around this by creating empty dbs from the template beforehand, allocating
(renaming) them on demand and periodically restocking those spare dbs.

- pg_dump/pg_restore on individual client dbs is a neat way to implement
backup/restore. It allows you to backup all clients sequentially as well as
concurrently (especially from multiple machines) depending on your
requirements.

- When partitioning into databases it's not trivial to reference data in
other databases. E.g. you can't have foreign keys to your main db (where
you track all clients and their dbs). This could probably be worked around
with dblink / foreign data wrappers if necessary.

- We just completed painless migration from 9.0 to 9.3 simply by installing
9.3 next to 9.0 on all machines and selectively migrating individual client
dbs with pg_dump | pg_restore over a period of 6 weeks. (We did not notice
any problems btw).

- Queries over the data of all clients (e.g. for internal monitoring or
statistics) naturally take a while as you'll have to connect to all
individual dbs and then manually aggregate the result from each one.

- Schema changes are not trivial as you need to develop tools to apply them
to all client dbs and template dbs (in our case). It gets tricky when the
process in interrupted and there are race conditions when new dbs are
created in the process that you need to protect against.

- Deleting all data of an individual client is a simple as dropping the db.


Hope that helps.

Oliver