Thread: using schema's for data separation
I'm re evaluating a few design choices I made a while back, and one that keeps coming to the forefront is data separation. We store sensitive information for clients. A database for each client isn't really workable, or at least I've never though of a way to make it workable, as we have several thousand clients and the databases all have to be accessed through a limited number of web applications where performance is important and things like persistant connections are a must. I've always been paranoid about a programmer error in an application resulting in data from multiple clients getting mixed together. Right now we create a schema for each client, with each schema having the same tables. The connections to the database are from an unprivileged user, and everything goes through functions that run at the necessary privileges. We us set_search_path to public,user. User data is in schema user and the functions are in the public schema. Every table has a client_id column. This has worked well so far but it's a real pain to manage and as we ramp up I'm not sure it's going to scale that well. So anyways my questions is this. Am I being too paranoid about putting all the data into one set of tables in a common schema? For thousands of clients what would you do? Chris
snacktime wrote: > This has worked well so far but it's a real pain to manage and as we > ramp up I'm not sure it's going to scale that well. So anyways my > questions is this. Am I being too paranoid about putting all the data > into one set of tables in a common schema? For thousands of clients > what would you do? Hard to say what to paranoid really is, so let me ask you what you think won't scale?
On 29/9/2006 15:29, "snacktime" <snacktime@gmail.com> wrote: > This has worked well so far but it's a real pain to manage and as we > ramp up I'm not sure it's going to scale that well. So anyways my > questions is this. Am I being too paranoid about putting all the data > into one set of tables in a common schema? For thousands of clients > what would you do? I would think of having a client table with their id as a permanent part of the where clause so that you can't work without specifying which client you are working for at the time. Not sure if a trigger would be able to ensure you can't add, update or delete unless the clientID is included (pretty sure you don't get to see the SQL only the results). You may need to write a plugin or custom mod to get 100% certainty that a statement can't be run without the clientID included in the search. It would have to be an easier solution than continually updating thousands of schemas to keep them in sync. Thinking about it - it should only be a small source change to the sql parser to stop it from running a statement that didn't include clientID in the where clause. A small change that is easy to add again to new versions as they are released. -- Shane Ambler Postgres@007Marketing.com Get Sheeky @ http://Sheeky.Biz
Shane Ambler <pgsql@007Marketing.com> writes: > Thinking about it - it should only be a small source change to the sql > parser to stop it from running a statement that didn't include clientID in > the where clause. > A small change that is easy to add again to new versions as they are > released. I'd avoid modifying source code and go with either a function or view. All queries should be against those and they'd fail if the id is missing. All filtered tables should have an index on such id, of course... -- Jorge Godoy <jgodoy@gmail.com>
I am using a similar solution, and I tested it with a test containing 20K+ different schemas. Postgres didn't show slowness at all even after the 20K (over 2 million total tables) were created. So I have feeling it can grow even more. Guy. On 9/28/06, snacktime <snacktime@gmail.com> wrote: > I'm re evaluating a few design choices I made a while back, and one > that keeps coming to the forefront is data separation. We store > sensitive information for clients. A database for each client isn't > really workable, or at least I've never though of a way to make it > workable, as we have several thousand clients and the databases all > have to be accessed through a limited number of web applications where > performance is important and things like persistant connections are a > must. I've always been paranoid about a programmer error in an > application resulting in data from multiple clients getting mixed > together. Right now we create a schema for each client, with each > schema having the same tables. The connections to the database are > from an unprivileged user, and everything goes through functions that > run at the necessary privileges. We us set_search_path to > public,user. User data is in schema user and the functions are in the > public schema. Every table has a client_id column. > > This has worked well so far but it's a real pain to manage and as we > ramp up I'm not sure it's going to scale that well. So anyways my > questions is this. Am I being too paranoid about putting all the data > into one set of tables in a common schema? For thousands of clients > what would you do? > > Chris > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- Family management on rails: http://www.famundo.com - coming soon! My development related blog: http://devblog.famundo.com
snacktime wrote: > I'm re evaluating a few design choices I made a while back, and one > that keeps coming to the forefront is data separation. We store > sensitive information for clients. A database for each client isn't > really workable, or at least I've never though of a way to make it > workable, as we have several thousand clients and the databases all > have to be accessed through a limited number of web applications where > performance is important and things like persistant connections are a > must. I've always been paranoid about a programmer error in an > application resulting in data from multiple clients getting mixed > together. Right now we create a schema for each client, with each > schema having the same tables. The connections to the database are > from an unprivileged user, and everything goes through functions that > run at the necessary privileges. We us set_search_path to > public,user. User data is in schema user and the functions are in the > public schema. Every table has a client_id column. > > This has worked well so far but it's a real pain to manage and as we > ramp up I'm not sure it's going to scale that well. So anyways my > questions is this. Am I being too paranoid about putting all the data > into one set of tables in a common schema? For thousands of clients > what would you do? Hi, where I work we have similar issues wherein we have thousands of clients who each have large amounts of the same kind of data that needs to be kept separate. What we've done is to use table inheritance. So, we have a group of base account data tables and whenever a new account is added they get a set of tables that inherit from these base tables. This works well in that whenever we need a global schema change to any of these tables we can just alter that pertinent base table and the change will cascade down to the child tables. In addition, we can customize individual accounts' tables however we may need without worrying about screwing up other accounts' data. -- erik jones <erik@myemma.com> software development emma(r)
On 9/29/06, Just Someone <just.some@gmail.com> wrote: > I am using a similar solution, and I tested it with a test containing > 20K+ different schemas. Postgres didn't show slowness at all even > after the 20K (over 2 million total tables) were created. So I have > feeling it can grow even more. That's good to know we haven't really tested it against that many schema's, other then actually creating them to make sure there wasn't some sort of hard limit or bug no one had run into before. Performance with schema's is actually one thing I do like. A query for any one user is only hitting the data in the one schema, so users with large data sets don't impact the query performance of users with smaller data sets.
On Fri, 2006-09-29 at 09:39 -0500, Erik Jones wrote: > What we've done is to use table inheritance. So, > we have a group of base account data tables and whenever a new account > is added they get a set of tables that inherit from these base tables. > This works well in that whenever we need a global schema change to any > of these tables we can just alter that pertinent base table and the > change will cascade down to the child tables. Many DDL commands do not propagate to the child tables -- foreign keys, for example, are not inherited. As a consequence, using inheritance for Chris' purpose has the important caveat that referential integrity will be difficult to ensure with standard FK constraints. I believe that other types of constraints and rules are also not propagated to child tables. Erik- do you have a clever solution to this aspect of using inheritance? On Thu, 2006-09-28 at 22:59 -0700, snacktime wrote: > The connections to the database are from an unprivileged user, and > everything goes through functions that run at the necessary > privileges. Someone out there must have implemented row-level security for postgresql (EnterpriseDB?). You could implement this yourself by adding a usesysid column to each table and using a combination of column defaults, views, triggers, or rules to set usesysid on insert and update and to require that the usesysid column matches the current user on select and delete. This probably improves the consistency of your security policy (over having the policy in functions). On the other hand, I could be blowing smoke -- it seems like this outta work, but I haven't actually done this myself. I don't actually do any real work myself. -Reece -- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0