Thread: Schema boggle...
Hi All, I work with Marc Fournier, for those who don't know, and have been working extensively with the schemas feature for Hub.Org's new Account Management package. Each client's data is stored in a seperate schema set asside just for them (though they will never have direct access to it, it helps us "balance" out the data. Each schema has the same set of tables, with which different data is stored, depending on the client. I've run into two issues and I've been reading over the docs in relation to schemas in the hopes that I could find a solution to my problems. Here are the details of the issues: 1. The first issue I'm running with, is that if I have several schemas with the following name: public test_001 test_002 test_003 the test_* schemas have the same tables listed in the their name spaces, such that: Schema test_001 contains: test_contact; test_domains; test_accounts; ...Schema test_002 contains: test_contact; test_domains; test_accounts; ... etc. As you can see this is a nice, clean way to break down some datasets. But, if I do: set search_path to public, test_001, test_002; I only get access to the tables in test_001 and public, the tables in test_002 are not listed, and thus I do not see them on the screen while doing a "\d". 2. Issue 2, is more a SQL issue then anything, but I'd like to be able to do something to the affect of: SELECT * FROM test_*.test_domains ORDER BY domain_name; or better yet: SELECT * from public.domain_summary ds LEFT JOIN test_*.test_domains td ON (td.domain_id = ds.id); This would mean I can run one query to get all of the information I need, rather then having to run a query for each client that I have to list domains for. If I've got 10,000 clients, that's going to take a ton of time to get the results I'm looking for... Does anyone know if any of these issues have been addressed in 7.4? I see some references to schema based changes, but no details on what those changes were. If not, are there any plans to do something like that? -- Chris Bowlby <excalibur@hub.org> Hub.Org Networking Services
Chris Bowlby wrote: > Hi All, > > I work with Marc Fournier, for those who don't know, and have been > working extensively with the schemas feature for Hub.Org's new Account > Management package. Each client's data is stored in a seperate schema > set asside just for them (though they will never have direct access to > it, it helps us "balance" out the data. Each schema has the same set of > tables, with which different data is stored, depending on the client. > > I've run into two issues and I've been reading over the docs in > relation to schemas in the hopes that I could find a solution to my > problems. Here are the details of the issues: > > 1. The first issue I'm running with, is that if I have several schemas > with the following name: > > public > test_001 > test_002 > test_003 > > the test_* schemas have the same tables listed in the their name > spaces, such that: > > Schema test_001 contains: > > test_contact; > test_domains; > test_accounts; > ... > > Schema test_002 contains: > > test_contact; > test_domains; > test_accounts; > ... > > etc. > > As you can see this is a nice, clean way to break down some datasets. > But, if I do: > > set search_path to public, test_001, test_002; > > I only get access to the tables in test_001 and public, the tables in > test_002 are not listed, and thus I do not see them on the screen while > doing a "\d". > > 2. Issue 2, is more a SQL issue then anything, but I'd like to be able > to do something to the affect of: > > SELECT * FROM test_*.test_domains ORDER BY domain_name; > > or better yet: > > SELECT * from public.domain_summary ds LEFT JOIN test_*.test_domains td > ON (td.domain_id = ds.id); > > This would mean I can run one query to get all of the information I > need, rather then having to run a query for each client that I have to > list domains for. If I've got 10,000 clients, that's going to take a ton > of time to get the results I'm looking for... > > Does anyone know if any of these issues have been addressed in 7.4? I > see some references to schema based changes, but no details on what > those changes were. If not, are there any plans to do something like > that? You can't use a wildcard like that, but you can create a view like CREATE VIEW public.all_test_domains AS SELECT * FROM test_001.test_domains UNION ALL SELECT * FROMtest_002.test_domains; You have to change that view as customers (resp. schemas) come and go. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Wed, 5 Nov 2003, Jan Wieck wrote: > Chris Bowlby wrote: > > > Hi All, > > > > I work with Marc Fournier, for those who don't know, and have been > > working extensively with the schemas feature for Hub.Org's new Account > > Management package. Each client's data is stored in a seperate schema > > set asside just for them (though they will never have direct access to > > it, it helps us "balance" out the data. Each schema has the same set of > > tables, with which different data is stored, depending on the client. > > > > I've run into two issues and I've been reading over the docs in > > relation to schemas in the hopes that I could find a solution to my > > problems. Here are the details of the issues: > > > > 1. The first issue I'm running with, is that if I have several schemas > > with the following name: > > > > public > > test_001 > > test_002 > > test_003 > > > > the test_* schemas have the same tables listed in the their name > > spaces, such that: > > > > Schema test_001 contains: > > > > test_contact; > > test_domains; > > test_accounts; > > ... > > > > Schema test_002 contains: > > > > test_contact; > > test_domains; > > test_accounts; > > ... > > > > etc. > > > > As you can see this is a nice, clean way to break down some datasets. > > But, if I do: > > > > set search_path to public, test_001, test_002; > > > > I only get access to the tables in test_001 and public, the tables in > > test_002 are not listed, and thus I do not see them on the screen while > > doing a "\d". > > > > 2. Issue 2, is more a SQL issue then anything, but I'd like to be able > > to do something to the affect of: > > > > SELECT * FROM test_*.test_domains ORDER BY domain_name; > > > > or better yet: > > > > SELECT * from public.domain_summary ds LEFT JOIN test_*.test_domains td > > ON (td.domain_id = ds.id); > > > > This would mean I can run one query to get all of the information I > > need, rather then having to run a query for each client that I have to > > list domains for. If I've got 10,000 clients, that's going to take a ton > > of time to get the results I'm looking for... > > > > Does anyone know if any of these issues have been addressed in 7.4? I > > see some references to schema based changes, but no details on what > > those changes were. If not, are there any plans to do something like > > that? > > You can't use a wildcard like that, but you can create a view like > > CREATE VIEW public.all_test_domains AS > SELECT * FROM test_001.test_domains > UNION ALL > SELECT * FROM test_002.test_domains; > > You have to change that view as customers (resp. schemas) come and go. Oh, just thought of it ... I know there used to be limitation on this, but I swore that they were removed ... how about a function? could you do something like: for i in `SELECT nspname FROM nspname WHERE nspname LIKE 'test_*'` ... I know functins returning multi-rows was a limitation in the past, but didn't that *just* get corrected, or am I mis-remembering?
Chris Bowlby <excalibur@hub.org> writes: > As you can see this is a nice, clean way to break down some datasets. > But, if I do: > set search_path to public, test_001, test_002; > I only get access to the tables in test_001 and public, the tables in > test_002 are not listed, and thus I do not see them on the screen while > doing a "\d". Well, sure. They are masked by the identically named tables in test_001. How else would you expect it to work? > 2. Issue 2, is more a SQL issue then anything, but I'd like to be able > to do something to the affect of: > SELECT * FROM test_*.test_domains ORDER BY domain_name; I can't see trying to support this at the SQL level. You might consider making a plpgsql function that constructs a UNION ALL query listing all the proper schemas and then EXECUTEs it on-the-fly. Another possible solution is to make all the individual tables be inheritance children of a master table. Then selecting from the master would work (and would be equivalent to the UNION performance-wise, I think). regards, tom lane
Chris, > I work with Marc Fournier, for those who don't know, and have been > working extensively with the schemas feature for Hub.Org's new Account > Management package. Each client's data is stored in a seperate schema > set asside just for them (though they will never have direct access to > it, it helps us "balance" out the data. Each schema has the same set of > tables, with which different data is stored, depending on the client. While this is an interesting use of Schema, it is not, IMHO, a good way to partition client data. Far better to have monolithic tables with a "client_id" column and then construct *views* which are available in each schema only to that client. Otherwise, as you've discovered, top-level management becomes a royal pain. I'd be happy to discuss this further on PGSQL-SQL, which is really the appropriate mailing list. -- -Josh BerkusAglio Database SolutionsSan Francisco
On Wed, 5 Nov 2003, Tom Lane wrote: > Chris Bowlby <excalibur@hub.org> writes: > > As you can see this is a nice, clean way to break down some datasets. > > But, if I do: > > > set search_path to public, test_001, test_002; > > > I only get access to the tables in test_001 and public, the tables in > > test_002 are not listed, and thus I do not see them on the screen while > > doing a "\d". > > Well, sure. They are masked by the identically named tables in > test_001. How else would you expect it to work? List of relationsSchema | Name | Type | Owner ----------+-----------------------+----------+-----------public | categories | table | 186_pgsqlpublic | categories_rec_id_seq | sequence | 186_pgsqltest_001 | table1 | table | 186_pgsqltest_002 | table1 | table | 186_pgsql the uniqueness, I would have thought, woudl have been schema.name, not just name ...
"Marc G. Fournier" <scrappy@postgresql.org> writes: >> How else would you expect it to work? > List of relations > Schema | Name | Type | Owner > ----------+-----------------------+----------+----------- > public | categories | table | 186_pgsql > public | categories_rec_id_seq | sequence | 186_pgsql > test_001 | table1 | table | 186_pgsql > test_002 | table1 | table | 186_pgsql You can get something like that if you say "\d *.*". If you say "\d *" or equivalently just "\d", then what you see is only the tables that you could reference with unqualified names. Which does not include the tables in test_002, because they're hidden by the ones in test_001. regards, tom lane
Actually, the use of schema's was my idea, to speed up some dreadfully slow queries dealing with traffic stats from a table that was growing painfully monolithic ... the idea is/was that it would be easier to backup/remove all data pertaining to a specific client if they decided to close their account ... On Wed, 5 Nov 2003, Josh Berkus wrote: > Chris, > > > I work with Marc Fournier, for those who don't know, and have been > > working extensively with the schemas feature for Hub.Org's new Account > > Management package. Each client's data is stored in a seperate schema > > set asside just for them (though they will never have direct access to > > it, it helps us "balance" out the data. Each schema has the same set of > > tables, with which different data is stored, depending on the client. > > While this is an interesting use of Schema, it is not, IMHO, a good way to > partition client data. Far better to have monolithic tables with a > "client_id" column and then construct *views* which are available in each > schema only to that client. Otherwise, as you've discovered, top-level > management becomes a royal pain. > > I'd be happy to discuss this further on PGSQL-SQL, which is really the > appropriate mailing list. > > -- > -Josh Berkus > Aglio Database Solutions > San Francisco > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >
"Marc G. Fournier" <scrappy@postgresql.org> writes: > Actually, the use of schema's was my idea, to speed up some dreadfully > slow queries dealing with traffic stats from a table that was growing > painfully monolithic ... the idea is/was that it would be easier to > backup/remove all data pertaining to a specific client if they decided to > close their account ... I would add an <aol> me too </aol> to the comment that this is a horrible idea and will be an enormous pain. You're denormalizing your data in a way that will cause you great pain. What you're trying to implement is called "partitioned tables". And you're right, bulk loading and deleting is one of the big advantages of partitioned tables. But Postgres doesn't support partitioned tables, so trying to roll your own is hard. Using schemas seems like a bad impedance match here too, as you've found out with the search_path. Inherited tables is a closer match, still a lot of things won't be automatic, and primary keys may be a problem. But what most people use as the closest approximation to partitioned tables in Postgres is partial indexes. It lets you keep your indexes to a reasonable size but still accelerate the poor-selectivity client_id column in parallel. But you still would be storing all the records in a single table and would have to do some big vacuuming whenever you delete a whole client. vacuum full may be your friend. I don't understand why the queries should be any slower dealing with the normalized data in a single table versus the dernormalized tables in multiple tables. The one big exception is any query doing "where client_id = ?" where the latter allows the use of a sequential scan instead of an index scan. -- greg