Thread: Schema boggle...

Schema boggle...

From
Chris Bowlby
Date:
Hi All, 
I work with Marc Fournier, for those who don't know, and have been
working extensively with the schemas feature for Hub.Org's new Account
Management package. Each client's data is stored in a seperate schema
set asside just for them (though they will never have direct access to
it, it helps us "balance" out the data. Each schema has the same set of
tables, with which different data is stored, depending on the client.
I've run into two issues and I've been reading over the docs in
relation to schemas in the hopes that I could find a solution to my
problems. Here are the details of the issues:
1. The first issue I'm running with, is that if I have several schemas
with the following name:
 public test_001 test_002 test_003
the test_* schemas have the same tables listed in the their name
spaces, such that:
Schema test_001 contains:
 test_contact; test_domains; test_accounts; ...Schema test_002 contains:
 test_contact; test_domains; test_accounts; ...
etc.
As you can see this is a nice, clean way to break down some datasets.
But, if I do:
set search_path to public, test_001, test_002;
I only get access to the tables in test_001 and public, the tables in
test_002 are not listed, and thus I do not see them on the screen while
doing a "\d".

2. Issue 2, is more a SQL issue then anything, but I'd like to be able
to do something to the affect of:
SELECT * FROM test_*.test_domains ORDER BY domain_name;
or better yet:
SELECT * from public.domain_summary ds LEFT JOIN test_*.test_domains td
ON (td.domain_id = ds.id);
This would mean I can run one query to get all of the information I
need, rather then having to run a query for each client that I have to
list domains for. If I've got 10,000 clients, that's going to take a ton
of time to get the results I'm looking for...
Does anyone know if any of these issues have been addressed in 7.4? I
see some references to schema based changes, but no details on what
those changes were. If not, are there any plans to do something like
that?
-- 
Chris Bowlby <excalibur@hub.org>
Hub.Org Networking Services



Re: Schema boggle...

From
Jan Wieck
Date:
Chris Bowlby wrote:

> Hi All, 
> 
>  I work with Marc Fournier, for those who don't know, and have been
> working extensively with the schemas feature for Hub.Org's new Account
> Management package. Each client's data is stored in a seperate schema
> set asside just for them (though they will never have direct access to
> it, it helps us "balance" out the data. Each schema has the same set of
> tables, with which different data is stored, depending on the client.
> 
>  I've run into two issues and I've been reading over the docs in
> relation to schemas in the hopes that I could find a solution to my
> problems. Here are the details of the issues:
> 
>  1. The first issue I'm running with, is that if I have several schemas
> with the following name:
> 
>   public
>   test_001
>   test_002
>   test_003
> 
>  the test_* schemas have the same tables listed in the their name
> spaces, such that:
> 
>  Schema test_001 contains:
> 
>   test_contact;
>   test_domains;
>   test_accounts;
>   ...
>  
>  Schema test_002 contains:
> 
>   test_contact;
>   test_domains;
>   test_accounts;
>   ...
> 
>  etc.
> 
>  As you can see this is a nice, clean way to break down some datasets.
> But, if I do:
> 
>  set search_path to public, test_001, test_002;
> 
>  I only get access to the tables in test_001 and public, the tables in
> test_002 are not listed, and thus I do not see them on the screen while
> doing a "\d".
> 
> 2. Issue 2, is more a SQL issue then anything, but I'd like to be able
> to do something to the affect of:
> 
>  SELECT * FROM test_*.test_domains ORDER BY domain_name;
> 
>  or better yet:
> 
>  SELECT * from public.domain_summary ds LEFT JOIN test_*.test_domains td
> ON (td.domain_id = ds.id);
> 
>  This would mean I can run one query to get all of the information I
> need, rather then having to run a query for each client that I have to
> list domains for. If I've got 10,000 clients, that's going to take a ton
> of time to get the results I'm looking for...
> 
>  Does anyone know if any of these issues have been addressed in 7.4? I
> see some references to schema based changes, but no details on what
> those changes were. If not, are there any plans to do something like
> that?

You can't use a wildcard like that, but you can create a view like
    CREATE VIEW public.all_test_domains AS        SELECT * FROM test_001.test_domains        UNION ALL        SELECT *
FROMtest_002.test_domains;
 

You have to change that view as customers (resp. schemas) come and go.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: Schema boggle...

From
"Marc G. Fournier"
Date:

On Wed, 5 Nov 2003, Jan Wieck wrote:

> Chris Bowlby wrote:
>
> > Hi All,
> >
> >  I work with Marc Fournier, for those who don't know, and have been
> > working extensively with the schemas feature for Hub.Org's new Account
> > Management package. Each client's data is stored in a seperate schema
> > set asside just for them (though they will never have direct access to
> > it, it helps us "balance" out the data. Each schema has the same set of
> > tables, with which different data is stored, depending on the client.
> >
> >  I've run into two issues and I've been reading over the docs in
> > relation to schemas in the hopes that I could find a solution to my
> > problems. Here are the details of the issues:
> >
> >  1. The first issue I'm running with, is that if I have several schemas
> > with the following name:
> >
> >   public
> >   test_001
> >   test_002
> >   test_003
> >
> >  the test_* schemas have the same tables listed in the their name
> > spaces, such that:
> >
> >  Schema test_001 contains:
> >
> >   test_contact;
> >   test_domains;
> >   test_accounts;
> >   ...
> >
> >  Schema test_002 contains:
> >
> >   test_contact;
> >   test_domains;
> >   test_accounts;
> >   ...
> >
> >  etc.
> >
> >  As you can see this is a nice, clean way to break down some datasets.
> > But, if I do:
> >
> >  set search_path to public, test_001, test_002;
> >
> >  I only get access to the tables in test_001 and public, the tables in
> > test_002 are not listed, and thus I do not see them on the screen while
> > doing a "\d".
> >
> > 2. Issue 2, is more a SQL issue then anything, but I'd like to be able
> > to do something to the affect of:
> >
> >  SELECT * FROM test_*.test_domains ORDER BY domain_name;
> >
> >  or better yet:
> >
> >  SELECT * from public.domain_summary ds LEFT JOIN test_*.test_domains td
> > ON (td.domain_id = ds.id);
> >
> >  This would mean I can run one query to get all of the information I
> > need, rather then having to run a query for each client that I have to
> > list domains for. If I've got 10,000 clients, that's going to take a ton
> > of time to get the results I'm looking for...
> >
> >  Does anyone know if any of these issues have been addressed in 7.4? I
> > see some references to schema based changes, but no details on what
> > those changes were. If not, are there any plans to do something like
> > that?
>
> You can't use a wildcard like that, but you can create a view like
>
>      CREATE VIEW public.all_test_domains AS
>          SELECT * FROM test_001.test_domains
>          UNION ALL
>          SELECT * FROM test_002.test_domains;
>
> You have to change that view as customers (resp. schemas) come and go.

Oh, just thought of it ... I know there used to be limitation on this, but
I swore that they were removed ... how about a function?  could you do
something like:

for i in `SELECT nspname FROM nspname WHERE nspname LIKE 'test_*'`

... I know functins returning multi-rows was a limitation in the past, but
didn't that *just* get corrected, or am I mis-remembering?



Re: Schema boggle...

From
Tom Lane
Date:
Chris Bowlby <excalibur@hub.org> writes:
>  As you can see this is a nice, clean way to break down some datasets.
> But, if I do:

>  set search_path to public, test_001, test_002;

>  I only get access to the tables in test_001 and public, the tables in
> test_002 are not listed, and thus I do not see them on the screen while
> doing a "\d".

Well, sure.  They are masked by the identically named tables in
test_001.  How else would you expect it to work?

> 2. Issue 2, is more a SQL issue then anything, but I'd like to be able
> to do something to the affect of:

>  SELECT * FROM test_*.test_domains ORDER BY domain_name;

I can't see trying to support this at the SQL level.  You might consider
making a plpgsql function that constructs a UNION ALL query listing all
the proper schemas and then EXECUTEs it on-the-fly.

Another possible solution is to make all the individual tables be
inheritance children of a master table.  Then selecting from the master
would work (and would be equivalent to the UNION performance-wise, I
think).
        regards, tom lane


Re: Schema boggle...

From
Josh Berkus
Date:
Chris,

>  I work with Marc Fournier, for those who don't know, and have been
> working extensively with the schemas feature for Hub.Org's new Account
> Management package. Each client's data is stored in a seperate schema
> set asside just for them (though they will never have direct access to
> it, it helps us "balance" out the data. Each schema has the same set of
> tables, with which different data is stored, depending on the client.

While this is an interesting use of Schema, it is not, IMHO, a good way to
partition client data.    Far better to have monolithic tables with a
"client_id" column and then construct *views* which are available in each
schema only to that client.   Otherwise, as you've discovered, top-level
management becomes a royal pain.

I'd be happy to discuss this further on PGSQL-SQL, which is really the
appropriate mailing list.

--
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Schema boggle...

From
"Marc G. Fournier"
Date:

On Wed, 5 Nov 2003, Tom Lane wrote:

> Chris Bowlby <excalibur@hub.org> writes:
> >  As you can see this is a nice, clean way to break down some datasets.
> > But, if I do:
>
> >  set search_path to public, test_001, test_002;
>
> >  I only get access to the tables in test_001 and public, the tables in
> > test_002 are not listed, and thus I do not see them on the screen while
> > doing a "\d".
>
> Well, sure.  They are masked by the identically named tables in
> test_001.  How else would you expect it to work?
                    List of relationsSchema   |         Name          |   Type   |   Owner
----------+-----------------------+----------+-----------public   | categories            | table    | 186_pgsqlpublic
| categories_rec_id_seq | sequence | 186_pgsqltest_001 | table1                | table    | 186_pgsqltest_002 | table1
             | table    | 186_pgsql
 

the uniqueness, I would have thought, woudl have been schema.name, not
just name ...



Re: Schema boggle...

From
Tom Lane
Date:
"Marc G. Fournier" <scrappy@postgresql.org> writes:
>>  How else would you expect it to work?

>                      List of relations
>  Schema   |         Name          |   Type   |   Owner
> ----------+-----------------------+----------+-----------
>  public   | categories            | table    | 186_pgsql
>  public   | categories_rec_id_seq | sequence | 186_pgsql
>  test_001 | table1                | table    | 186_pgsql
>  test_002 | table1                | table    | 186_pgsql

You can get something like that if you say "\d *.*".  If you say "\d *"
or equivalently just "\d", then what you see is only the tables that you
could reference with unqualified names.  Which does not include the
tables in test_002, because they're hidden by the ones in test_001.
        regards, tom lane


Re: Schema boggle...

From
"Marc G. Fournier"
Date:
Actually, the use of schema's was my idea, to speed up some dreadfully
slow queries dealing with traffic stats from a table that was growing
painfully monolithic ... the idea is/was that it would be easier to
backup/remove all data pertaining to a specific client if they decided to
close their account ...

On Wed, 5 Nov 2003, Josh Berkus wrote:

> Chris,
>
> >  I work with Marc Fournier, for those who don't know, and have been
> > working extensively with the schemas feature for Hub.Org's new Account
> > Management package. Each client's data is stored in a seperate schema
> > set asside just for them (though they will never have direct access to
> > it, it helps us "balance" out the data. Each schema has the same set of
> > tables, with which different data is stored, depending on the client.
>
> While this is an interesting use of Schema, it is not, IMHO, a good way to
> partition client data.    Far better to have monolithic tables with a
> "client_id" column and then construct *views* which are available in each
> schema only to that client.   Otherwise, as you've discovered, top-level
> management becomes a royal pain.
>
> I'd be happy to discuss this further on PGSQL-SQL, which is really the
> appropriate mailing list.
>
> --
> -Josh Berkus
>  Aglio Database Solutions
>  San Francisco
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>


Re: Schema boggle...

From
Greg Stark
Date:
"Marc G. Fournier" <scrappy@postgresql.org> writes:

> Actually, the use of schema's was my idea, to speed up some dreadfully
> slow queries dealing with traffic stats from a table that was growing
> painfully monolithic ... the idea is/was that it would be easier to
> backup/remove all data pertaining to a specific client if they decided to
> close their account ...

I would add an <aol> me too </aol> to the comment that this is a horrible idea
and will be an enormous pain. You're denormalizing your data in a way that
will cause you great pain. 

What you're trying to implement is called "partitioned tables". And you're
right, bulk loading and deleting is one of the big advantages of partitioned
tables. But Postgres doesn't support partitioned tables, so trying to roll
your own is hard.

Using schemas seems like a bad impedance match here too, as you've found out
with the search_path. Inherited tables is a closer match, still a lot of
things won't be automatic, and primary keys may be a problem.

But what most people use as the closest approximation to partitioned tables in
Postgres is partial indexes. It lets you keep your indexes to a reasonable
size but still accelerate the poor-selectivity client_id column in parallel.

But you still would be storing all the records in a single table and would
have to do some big vacuuming whenever you delete a whole client. vacuum full
may be your friend.

I don't understand why the queries should be any slower dealing with the
normalized data in a single table versus the dernormalized tables in multiple
tables. The one big exception is any query doing "where client_id = ?" where
the latter allows the use of a sequential scan instead of an index scan.

-- 
greg