Thread: using schema's for data separation

using schema's for data separation

From

snacktime

Date:

29 September 2006, 02:59:04

I'm re evaluating a few design choices I made a while back, and one
that keeps coming to the forefront is data separation.  We store
sensitive information for clients.  A database for each client isn't
really workable, or at least I've never though of a way to make it
workable, as we have several thousand clients and the databases all
have to be accessed through a limited number of web applications where
performance is important and things like persistant connections are a
must.  I've always been paranoid about a programmer error in an
application resulting in data from multiple clients getting mixed
together.  Right now we create a schema for each client, with each
schema having the same tables.  The connections to the database are
from an unprivileged user, and everything goes through functions that
run at the necessary privileges.  We us set_search_path to
public,user.  User data is in schema user and the functions are in the
public schema.  Every table has a client_id column.

This has worked well so far but it's a real pain to manage and as we
ramp up I'm not sure it's going to scale that well.  So anyways my
questions is this.  Am I being too paranoid about putting all the data
into one set of tables in a common schema?  For thousands of clients
what would you do?

Chris

Re: using schema's for data separation

From

"Matthew T. O'Connor"

Date:

29 September 2006, 10:39:24

snacktime wrote:
> This has worked well so far but it's a real pain to manage and as we
> ramp up I'm not sure it's going to scale that well.  So anyways my
> questions is this.  Am I being too paranoid about putting all the data
> into one set of tables in a common schema?  For thousands of clients
> what would you do?

Hard to say what to paranoid really is, so let me ask you what you think
won't scale?

Re: using schema's for data separation

From

Shane Ambler

Date:

29 September 2006, 11:10:51

On 29/9/2006 15:29, "snacktime" <snacktime@gmail.com> wrote:

> This has worked well so far but it's a real pain to manage and as we
> ramp up I'm not sure it's going to scale that well.  So anyways my
> questions is this.  Am I being too paranoid about putting all the data
> into one set of tables in a common schema?  For thousands of clients
> what would you do?

I would think of having a client table with their id as a permanent part of
the where clause so that you can't work without specifying which client you
are working for at the time.  Not sure if a trigger would be able to ensure
you can't add, update or delete unless the clientID is included (pretty sure
you don't get to see the SQL only the results).
You may need to write a plugin or custom mod to get 100% certainty that a
statement can't be run without the clientID included in the search.

It would have to be an easier solution than continually updating thousands
of schemas to keep them in sync.

Thinking about it - it should only be a small source change to the sql
parser to stop it from running a statement that didn't include clientID in
the where clause.
A small change that is easy to add again to new versions as they are
released.

--

Shane Ambler
Postgres@007Marketing.com

Get Sheeky @ http://Sheeky.Biz

Re: using schema's for data separation

From

Jorge Godoy

Date:

29 September 2006, 11:21:35

Shane Ambler <pgsql@007Marketing.com> writes:

> Thinking about it - it should only be a small source change to the sql
> parser to stop it from running a statement that didn't include clientID in
> the where clause.
> A small change that is easy to add again to new versions as they are
> released.

I'd avoid modifying source code and go with either a function or view.  All
queries should be against those and they'd fail if the id is missing.  All
filtered tables should have an index on such id, of course...

--
Jorge Godoy      <jgodoy@gmail.com>

Re: using schema's for data separation

From

"Just Someone"

Date:

29 September 2006, 11:31:00

I am using a similar solution, and I tested it with a test containing
20K+ different schemas. Postgres didn't show slowness at all even
after the 20K (over 2 million total tables) were created. So I have
feeling it can grow even more.

Guy.


On 9/28/06, snacktime <snacktime@gmail.com> wrote:
> I'm re evaluating a few design choices I made a while back, and one
> that keeps coming to the forefront is data separation.  We store
> sensitive information for clients.  A database for each client isn't
> really workable, or at least I've never though of a way to make it
> workable, as we have several thousand clients and the databases all
> have to be accessed through a limited number of web applications where
> performance is important and things like persistant connections are a
> must.  I've always been paranoid about a programmer error in an
> application resulting in data from multiple clients getting mixed
> together.  Right now we create a schema for each client, with each
> schema having the same tables.  The connections to the database are
> from an unprivileged user, and everything goes through functions that
> run at the necessary privileges.  We us set_search_path to
> public,user.  User data is in schema user and the functions are in the
> public schema.  Every table has a client_id column.
>
> This has worked well so far but it's a real pain to manage and as we
> ramp up I'm not sure it's going to scale that well.  So anyways my
> questions is this.  Am I being too paranoid about putting all the data
> into one set of tables in a common schema?  For thousands of clients
> what would you do?
>
> Chris
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>


--
Family management on rails: http://www.famundo.com - coming soon!
My development related blog: http://devblog.famundo.com

Re: using schema's for data separation

From

Erik Jones

Date:

29 September 2006, 11:39:23

snacktime wrote:
> I'm re evaluating a few design choices I made a while back, and one
> that keeps coming to the forefront is data separation.  We store
> sensitive information for clients.  A database for each client isn't
> really workable, or at least I've never though of a way to make it
> workable, as we have several thousand clients and the databases all
> have to be accessed through a limited number of web applications where
> performance is important and things like persistant connections are a
> must.  I've always been paranoid about a programmer error in an
> application resulting in data from multiple clients getting mixed
> together.  Right now we create a schema for each client, with each
> schema having the same tables.  The connections to the database are
> from an unprivileged user, and everything goes through functions that
> run at the necessary privileges.  We us set_search_path to
> public,user.  User data is in schema user and the functions are in the
> public schema.  Every table has a client_id column.
>
> This has worked well so far but it's a real pain to manage and as we
> ramp up I'm not sure it's going to scale that well.  So anyways my
> questions is this.  Am I being too paranoid about putting all the data
> into one set of tables in a common schema?  For thousands of clients
> what would you do?
Hi, where I work we have similar issues wherein we have thousands of
clients who each have large amounts of the same kind of data that  needs
to be kept separate.  What we've done is to use table inheritance.  So,
we have a group of base account data tables and whenever a new account
is added they get a set of tables that inherit from these base tables.
This works well in that whenever we need a global schema change to any
of these tables we can just alter that pertinent base table and the
change will cascade down to the child tables.  In addition, we can
customize individual accounts' tables however we may need without
worrying about screwing up other accounts' data.

--
erik jones <erik@myemma.com>
software development
emma(r)

Re: using schema's for data separation

From

snacktime

Date:

29 September 2006, 15:27:20

On 9/29/06, Just Someone <just.some@gmail.com> wrote:
> I am using a similar solution, and I tested it with a test containing
> 20K+ different schemas. Postgres didn't show slowness at all even
> after the 20K (over 2 million total tables) were created. So I have
> feeling it can grow even more.

That's good to know we haven't really tested it against that many
schema's, other then actually creating them to make sure there wasn't
some sort of hard limit or bug no one had run into before.
Performance with schema's is actually one thing I do like.  A query
for any one user is only hitting the data in the one schema, so users
with large data sets don't impact the query performance of users with
smaller data sets.

Re: using schema's for data separation

From

Reece Hart

Date:

29 September 2006, 21:40:19

On Fri, 2006-09-29 at 09:39 -0500, Erik Jones wrote:
> What we've done is to use table inheritance.  So,
> we have a group of base account data tables and whenever a new account
> is added they get a set of tables that inherit from these base tables.
> This works well in that whenever we need a global schema change to any
> of these tables we can just alter that pertinent base table and the
> change will cascade down to the child tables.

Many DDL commands do not propagate to the child tables -- foreign keys,
for example, are not inherited.  As a consequence, using inheritance for
Chris' purpose has the important caveat that referential integrity will
be difficult to ensure with standard FK constraints.  I believe that
other types of constraints and rules are also not propagated to child
tables.  Erik- do you have a clever solution to this aspect of using
inheritance?

On Thu, 2006-09-28 at 22:59 -0700, snacktime wrote:
> The connections to the database are from an unprivileged user, and
> everything goes through functions that run at the necessary
> privileges.

Someone out there must have implemented row-level security for
postgresql (EnterpriseDB?).  You could implement this yourself by adding
a usesysid column to each table and using a combination of column
defaults, views, triggers, or rules to set usesysid on insert and update
and to require that the usesysid column matches the current user on
select and delete.  This probably improves the consistency of your
security policy (over having the policy in functions).

On the other hand, I could be blowing smoke -- it seems like this outta
work, but I haven't actually done this myself.  I don't actually do any
real work myself.

-Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0