Thread: Databases vs Schemas

Databases vs Schemas

From
"Scott Otis"
Date:

I am seeking advice on what the best setup for the following would be.

 

My company provides a hosted web calendaring solution for school districts.  For each school district we have a separate database.  Each database has 57 tables.  There are a total of 649 fields in those tables.  Here is a table of the different kinds of field and how many there are:

 

time without time zone

bytea

date

smallint

boolean

integer

timestamp without time zone

numeric

text

9

4

8

1

79

195

36

8

309

 

 

Over the next couple of months we will be creating an instance of our solution for each public school district in the US which is around 18,000.  That means currently we would be creating 18,000 databases (all on one server right now – which is running 8.4).  I am assuming this is probably not the best way of doing things.

 

I have read up on schemas and it looks like a good change to make would be to create 1 database with 18,000 schemas.

 

Would that be a good idea?  What sort of issues should I be aware of (administrative, management, performance, etc…)?  Is that too many schemas to put into 1 database?  What are the limits on the number of databases and schemas you can create?

 

Should I try to re-engineer things so that all 18,000 instances only use 1 database and 1 schema?

 

Let me know if you need any more info.

 

Any advice and information would be greatly appreciated.

 

Regards,

 

Scott Otis

CIO / Lead Developer

Intand

www.intand.com

 

Re: Databases vs Schemas

From
Ben Chobot
Date:
Scott Otis wrote:
>
> I am seeking advice on what the best setup for the following would be.
>
>
>
> My company provides a hosted web calendaring solution for school
> districts.  For each school district we have a separate database.
> Each database has 57 tables.
>
....
>
> Over the next couple of months we will be creating an instance of our
> solution for each public school district in the US which is around
> 18,000.
>


Why are you trying to keep all this information on one server? It seems
like you have such perfectly independent silos of data, why not take the
opportunity to scale out horizontally? It's usually a lot cheaper to buy
4 machines of power x than one machine of power (4x).

Re: Databases vs Schemas

From
Merlin Moncure
Date:
On Fri, Oct 9, 2009 at 1:46 PM, Scott Otis <scott.otis@intand.com> wrote:
> Over the next couple of months we will be creating an instance of our solution for each public school district in the
USwhich is around 18,000.  That means currently we would be creating 18,000 databases (all on one server right now –
whichis running 8.4).  I am assuming this is probably not the best way of doing things. 

Schema advantages:
*) maintenance advantages; all functions/trigger functions can be
shared.  HUGE help if you use them
*) can query shared data between schemas without major headaches
*) a bit more efficiency especially if private data areas are small.
kinda analogous to processes vs threads
*) Can manage the complete system without changing database sessions.
This is the kicker IMO.

Database Advantages:
*) More discrete.  Easier to distinctly create, dump, drop, or move to
separate server
*) Smaller system catalogs might give efficiency benefits

merlin

Re: Databases vs Schemas

From
Scott Carey
Date:


On 10/9/09 2:02 PM, "Merlin Moncure" <mmoncure@gmail.com> wrote:

> On Fri, Oct 9, 2009 at 1:46 PM, Scott Otis <scott.otis@intand.com> wrote:
>> Over the next couple of months we will be creating an instance of our
>> solution for each public school district in the US which is around 18,000. 
>> That means currently we would be creating 18,000 databases (all on one server
>> right now ­ which is running 8.4).  I am assuming this is probably not the
>> best way of doing things.
>
> Schema advantages:
> *) maintenance advantages; all functions/trigger functions can be
> shared.  HUGE help if you use them
> *) can query shared data between schemas without major headaches
> *) a bit more efficiency especially if private data areas are small.
> kinda analogous to processes vs threads
> *) Can manage the complete system without changing database sessions.
> This is the kicker IMO.
>
> Database Advantages:
> *) More discrete.  Easier to distinctly create, dump, drop, or move to
> separate server
> *) Smaller system catalogs might give efficiency benefits
>

I'm concerned how a system with 57 * 18000 > 1M tables will function.

I've got 200,000 tables in one db (8.4), and some tools barely work.  The
system catalogs get inefficient when large and psql especially has trouble.
Tab completion takes forever, even if I make a schema "s" with one table in
it and type "s." and try and tab complete -- its as if its scanning all
without a schema qualifier or using an index.  Sometimes it does not match
valid tables at all, and sometimes regex matching fails too ('\dt
schema.*_*_*' intermittently flakes out if it returns a lot of matches).
Other than that the number of tables doesn't seem to cause much performance
trouble.  The only exception is constraint exclusion which is fundamentally
broken with too many tables on the performance and memory consumption side.

Having a lot of tables really makes me wish VACUUM, ANALYZE, and other
maintenance tools could partially matched object names with regex though.

On the other hand, lots of databases probably has performance drawbacks too.
And its maintenance drawbacks are even bigger.

I certainly don't see any reason at all to try and put all of these in one
schema.  The only useful choices are schemas vs databases.  I'd go for
schemas unless the performance issues there are a problem.   Schemas can be
dumped/restored/backed up independent of one another easily too.

> merlin
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: Databases vs Schemas

From
Tom Lane
Date:
Scott Carey <scott@richrelevance.com> writes:
> I've got 200,000 tables in one db (8.4), and some tools barely work.  The
> system catalogs get inefficient when large and psql especially has trouble.
> Tab completion takes forever, even if I make a schema "s" with one table in
> it and type "s." and try and tab complete -- its as if its scanning all
> without a schema qualifier or using an index.

The tab-completion queries have never been vetted for performance
particularly :-(

Just out of curiosity, how much does this help?

alter function pg_table_is_visible(oid) cost 10;

(You'll need to do it as superuser --- if it makes things worse, just
set the cost back to 1.)

> Sometimes it does not match
> valid tables at all, and sometimes regex matching fails too ('\dt
> schema.*_*_*' intermittently flakes out if it returns a lot of matches).

There are some arbitrary "LIMIT 1000" clauses in those queries, which
probably explains this ... but taking them out would likely cause
libreadline to get indigestion ...

            regards, tom lane

Re: Databases vs Schemas

From
Merlin Moncure
Date:
On Fri, Oct 9, 2009 at 10:50 PM, Scott Carey <scott@richrelevance.com> wrote:
> On 10/9/09 2:02 PM, "Merlin Moncure" <mmoncure@gmail.com> wrote:
>
>> On Fri, Oct 9, 2009 at 1:46 PM, Scott Otis <scott.otis@intand.com> wrote:
>>> Over the next couple of months we will be creating an instance of our
>>> solution for each public school district in the US which is around 18,000.
>>> That means currently we would be creating 18,000 databases (all on one server
>>> right now ­ which is running 8.4).  I am assuming this is probably not the
>>> best way of doing things.
>>
>> Schema advantages:
>> *) maintenance advantages; all functions/trigger functions can be
>> shared.  HUGE help if you use them
>> *) can query shared data between schemas without major headaches
>> *) a bit more efficiency especially if private data areas are small.
>> kinda analogous to processes vs threads
>> *) Can manage the complete system without changing database sessions.
>> This is the kicker IMO.
>>
>> Database Advantages:
>> *) More discrete.  Easier to distinctly create, dump, drop, or move to
>> separate server
>> *) Smaller system catalogs might give efficiency benefits
>>
>
> I'm concerned how a system with 57 * 18000 > 1M tables will function.
>
> I've got 200,000 tables in one db (8.4), and some tools barely work.  The
> system catalogs get inefficient when large and psql especially has trouble.
> Tab completion takes forever, even if I make a schema "s" with one table in
> it and type "s." and try and tab complete -- its as if its scanning all
> without a schema qualifier or using an index.  Sometimes it does not match
> valid tables at all, and sometimes regex matching fails too ('\dt
> schema.*_*_*' intermittently flakes out if it returns a lot of matches).
> Other than that the number of tables doesn't seem to cause much performance
> trouble.  The only exception is constraint exclusion which is fundamentally
> broken with too many tables on the performance and memory consumption side.
>
> Having a lot of tables really makes me wish VACUUM, ANALYZE, and other
> maintenance tools could partially matched object names with regex though.
>
> On the other hand, lots of databases probably has performance drawbacks too.
> And its maintenance drawbacks are even bigger.
>
> I certainly don't see any reason at all to try and put all of these in one
> schema.  The only useful choices are schemas vs databases.  I'd go for
> schemas unless the performance issues there are a problem.   Schemas can be
> dumped/restored/backed up independent of one another easily too.

They can, but: drop schema foo cascade; is a different operation than:
drop database foo;  The first is kinda surgical and the second is a
rocket launcher.  What would you rather have in battle?

For the record, just about every database I've ever designed has had
some of what I call 'de facto table partitioning' using
schemas/search_path tricks.  I'm working on a system right now that is
going to get very large and if I started to run into psql problems I'd
probably look at patching it, maybe \set an option  to simplify some
of the queries.

merlin

Re: Databases vs Schemas

From
Chris Kratz
Date:
On Fri, Oct 9, 2009 at 11:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Scott Carey <scott@richrelevance.com> writes:
> I've got 200,000 tables in one db (8.4), and some tools barely work.  The
> system catalogs get inefficient when large and psql especially has trouble.
> Tab completion takes forever, even if I make a schema "s" with one table in
> it and type "s." and try and tab complete -- its as if its scanning all
> without a schema qualifier or using an index.

The tab-completion queries have never been vetted for performance
particularly :-(

Just out of curiosity, how much does this help?

alter function pg_table_is_visible(oid) cost 10;

(You'll need to do it as superuser --- if it makes things worse, just
set the cost back to 1.)

> Sometimes it does not match
> valid tables at all, and sometimes regex matching fails too ('\dt
> schema.*_*_*' intermittently flakes out if it returns a lot of matches).

There are some arbitrary "LIMIT 1000" clauses in those queries, which
probably explains this ... but taking them out would likely cause
libreadline to get indigestion ...

                       regards, tom lane
 
We ran into this exact situation with a pg 8.3 database and a very large number of tables.  psql would wait for 20 to 30 seconds if the user was unlucky enough to hit the tab key.  After doing some research with query logging, explain analyze and some trial and error, we came to the same conclusion.  Altering the cost for the pg_table_is_visible function to 10 fixed our performance problem immediately.  It appears that when the cost was set to 1, that the query optimizer first ran the function over the entire pg_class table.  By increasing the cost, it now only runs the function over the rows returned by the other items in the where clause.

-chris

Re: Databases vs Schemas

From
Scott Marlowe
Date:
On Sat, Oct 10, 2009 at 8:44 AM, Chris Kratz <chris.kratz@vistashare.com> wrote:
>>
>> alter function pg_table_is_visible(oid) cost 10;
>>
>> (You'll need to do it as superuser --- if it makes things worse, just
>> set the cost back to 1.)
>>
>> > Sometimes it does not match
>> > valid tables at all, and sometimes regex matching fails too ('\dt
>> > schema.*_*_*' intermittently flakes out if it returns a lot of matches).
>>
>> There are some arbitrary "LIMIT 1000" clauses in those queries, which
>> probably explains this ... but taking them out would likely cause
>> libreadline to get indigestion ...
>>
>>                        regards, tom lane
>
>
> We ran into this exact situation with a pg 8.3 database and a very large
> number of tables.  psql would wait for 20 to 30 seconds if the user was
> unlucky enough to hit the tab key.  After doing some research with query
> logging, explain analyze and some trial and error, we came to the same
> conclusion.  Altering the cost for the pg_table_is_visible function to 10
> fixed our performance problem immediately.  It appears that when the cost
> was set to 1, that the query optimizer first ran the function over the
> entire pg_class table.  By increasing the cost, it now only runs the
> function over the rows returned by the other items in the where clause.

We have a large number of objects in our db and this worked for me
too!  Thanks a lot.  As a side note, it also makes slony create set
stuff run really really slow as well, and I'm guessing there's a
similar trick for the slony functions I can add and see if it helps.