Thread: Re-partitioning huge schema

Re-partitioning huge schema

From

Erik Jones

Date:

03 December 2007, 21:32:59

Hi, I've inherited <smirk> a database schema wherein the original
developers took the inheritance mechanism to an extreme where new
client accounts get 13 different tables of their own created for
them.  We're at the many tens of thousands of tables mark (well over
100K) and I'm going to be re-partitioning most of those 13 table
types.  Basically, what I'm looking for here is advice on whether or
not my proposed method of repartitioning the data is valid or if I
should be thinking differently.

We're an email marketing campaign creation, delivery, and response
tracking delivery service and so I'll use the opens tracking tables
for this.

Here's the opens tables' layout:

           Column           |            Type             |
Modifiers
---------------------------+-----------------------------
+------------------------
message_user_agent    | character varying(255)           | not null
message_open_ts          | timestamp without time zone | not null
default now()
message_is_text_only   |
smallint                                       | not null default 0
message_id                     |
bigint                                           | not null
mailing_id                        |
bigint                                           |
member_id                      |
bigint                                           |
Indexes:
    "u_mesopens_pkey" PRIMARY KEY, btree (emma_message_id)
    "u_mesopens_emma_message_open_ts" btree (emma_message_open_ts)
    "u_mesopens_emma_mailing_id_idx" btree (emma_mailing_id)
    "u_mesopens_emma_member_id_idx" btree (emma_member_id)

All of the other types will follow this general style, i.e. id based
primary key and indexes along with a timestamp somewhere in there
recording the original entry time.  The majority of the queries we
run against these are solely id based with some being time (or id +
time) based, probably less than 10-20% for the latter.

In order to get a ballpark idea of the current table stats I ran the
following query against pg_class for all of these type of tables
(this one is just for the opens):

select substring(relname from '[0-9]+_(.*)$') as ttype,
        count(relname) as total_tables,
        sum(reltuples) as total_tuples,
        max(reltuples) as max_tuples,
        to_char(avg(reltuples), '999999.99') as avg_num_tuples,
        to_char(stddev_pop(reltuples), '9999999.99') as std_dev_from_avg
from pg_class
where relname ~ 'opens$'
     and substring(relname from '[0-9]+_(.*)$') is not null
group by ttype;

With the following results:

-[ RECORD 1 ]----+-----------------------
ttype                          | messages_history_opens
total_tables              | 14027
total_tuples              |   139284528
max_tuples              |   2760599
avg_num_tuples    |    9929.84
std_dev_from_avg |    59945.51

Now, for this discussion, let me also point out that we've had a
pretty steady growth rate of about 150% per year since the company
opened 5 years ago and, for the sake of argument, we'll use that here
although we really don't have any guarantees past personal faith that
we can maintaing that :)

So, I'm looking at both a bin partiioning system or range based on
the date timestamps and both seem to have their pros and cons.  For
the bin example, for this table type if I set it at 300 partitions it
will take approximately 5 years before any of the partitions reaches
the size of our current largest opens table.  This is obviously very
attractive from a management perspective and has the added advantage
that I could probably expect the spread of data (volume per table) to
be pretty even over time.  However, it's those times somebody wants
to ask, "Show me all my members who have opened in the last year"
that it becomes a problem as that data would most likely be spread
over a radically varying number of partitions.  I could probably
solve that by making the partition based on modulo account_id.

The other option, of course, is to go with the "standard" (my quotes
based on what I've seen on this and the performance list) range based
partitioning.  However, it really wouldn't make a lot of sense to
make those partitions based on the timestamp fields as the data for
one mailing could then span many of the partitions and the same
question I noted in the last paragraph would again result in hitting
potentially many (although not as many as before) partitions.  So,
what I'm left with here is to partition on id ranges but, in order to
maintain a somewhat stable maximum partition size, I'd have to play a
statistical guessing game wrt the size previous month's (or week's)
partitions grew to v. their defined id ranges.  That's doable, but
not very attractive.  To be honest, I'm not even convinced that
that's something I'd really need to worry about on a month-to-month
basis.

Seen from one perspective, partitioning on id ranges v. using the bin
partitioning method are kind of similar.  The biggest differences
being that with the range setup the majority of the working set of
data will be in the last couple of month's partitions whereas with
the bin method it will be spread pretty evenly across all of them,
i.e. lots of pages for a couple pages constantly being worked with v.
a couple pages from a lot of tables constantly being worked with.

Any thoughts or advice?

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

Re: Re-partitioning huge schema

From

Robert Treat

Date:

05 December 2007, 20:23:55

On Monday 03 December 2007 17:32, Erik Jones wrote:
> Hi, I've inherited <smirk> a database schema wherein the original
> developers took the inheritance mechanism to an extreme where new
> client accounts get 13 different tables of their own created for
> them.  We're at the many tens of thousands of tables mark (well over
> 100K) and I'm going to be re-partitioning most of those 13 table
> types.  Basically, what I'm looking for here is advice on whether or
> not my proposed method of repartitioning the data is valid or if I
> should be thinking differently.
>
> We're an email marketing campaign creation, delivery, and response
> tracking delivery service and so I'll use the opens tracking tables
> for this.
>
> Here's the opens tables' layout:
>
>            Column           |            Type             |
> Modifiers
> ---------------------------+-----------------------------
> +------------------------
> message_user_agent    | character varying(255)           | not null
> message_open_ts          | timestamp without time zone | not null
> default now()
> message_is_text_only   |
> smallint                                       | not null default 0
> message_id                     |
> bigint                                           | not null
> mailing_id                        |
> bigint                                           |
> member_id                      |
> bigint                                           |
> Indexes:
>     "u_mesopens_pkey" PRIMARY KEY, btree (emma_message_id)
>     "u_mesopens_emma_message_open_ts" btree (emma_message_open_ts)
>     "u_mesopens_emma_mailing_id_idx" btree (emma_mailing_id)
>     "u_mesopens_emma_member_id_idx" btree (emma_member_id)
>
> All of the other types will follow this general style, i.e. id based
> primary key and indexes along with a timestamp somewhere in there
> recording the original entry time.  The majority of the queries we
> run against these are solely id based with some being time (or id +
> time) based, probably less than 10-20% for the latter.
>
> In order to get a ballpark idea of the current table stats I ran the
> following query against pg_class for all of these type of tables
> (this one is just for the opens):
>
> select substring(relname from '[0-9]+_(.*)$') as ttype,
>         count(relname) as total_tables,
>         sum(reltuples) as total_tuples,
>         max(reltuples) as max_tuples,
>         to_char(avg(reltuples), '999999.99') as avg_num_tuples,
>         to_char(stddev_pop(reltuples), '9999999.99') as std_dev_from_avg
> from pg_class
> where relname ~ 'opens$'
>      and substring(relname from '[0-9]+_(.*)$') is not null
> group by ttype;
>
> With the following results:
>
> -[ RECORD 1 ]----+-----------------------
> ttype                          | messages_history_opens
> total_tables              | 14027
> total_tuples              |   139284528
> max_tuples              |   2760599
> avg_num_tuples    |    9929.84
> std_dev_from_avg |    59945.51
>
> Now, for this discussion, let me also point out that we've had a
> pretty steady growth rate of about 150% per year since the company
> opened 5 years ago and, for the sake of argument, we'll use that here
> although we really don't have any guarantees past personal faith that
> we can maintaing that :)
>
> So, I'm looking at both a bin partiioning system or range based on
> the date timestamps and both seem to have their pros and cons.  For
> the bin example, for this table type if I set it at 300 partitions it
> will take approximately 5 years before any of the partitions reaches
> the size of our current largest opens table.  This is obviously very
> attractive from a management perspective and has the added advantage
> that I could probably expect the spread of data (volume per table) to
> be pretty even over time.

What would you base these bins on?  If you are doing it based on the surrogate
key, then your going to spread data across both dates and accounts into the
bins, which seems like it would make the majority of your queries not use
partitions smartly.

> However, it's those times somebody wants
> to ask, "Show me all my members who have opened in the last year"
> that it becomes a problem as that data would most likely be spread
> over a radically varying number of partitions.  I could probably
> solve that by making the partition based on modulo account_id.
>

if your bins are based on account_id that might run you the risk of filling up
the bins disproprotionally.

> The other option, of course, is to go with the "standard" (my quotes
> based on what I've seen on this and the performance list) range based
> partitioning.  However, it really wouldn't make a lot of sense to
> make those partitions based on the timestamp fields as the data for
> one mailing could then span many of the partitions and the same
> question I noted in the last paragraph would again result in hitting
> potentially many (although not as many as before) partitions.

the thing to think about is going one step past the CE. What is better, an
index lookup based on time across account_id based partitions, or and index
lookup on account_ids in time based partitions.

> So,
> what I'm left with here is to partition on id ranges but, in order to
> maintain a somewhat stable maximum partition size, I'd have to play a
> statistical guessing game wrt the size previous month's (or week's)
> partitions grew to v. their defined id ranges.  That's doable, but
> not very attractive.  To be honest, I'm not even convinced that
> that's something I'd really need to worry about on a month-to-month
> basis.
>

I dont think it is. And remember you don't have to keep your date partitions
as equal intervals.

> Seen from one perspective, partitioning on id ranges v. using the bin
> partitioning method are kind of similar.  The biggest differences
> being that with the range setup the majority of the working set of
> data will be in the last couple of month's partitions whereas with
> the bin method it will be spread pretty evenly across all of them,
> i.e. lots of pages for a couple pages constantly being worked with v.
> a couple pages from a lot of tables constantly being worked with.
>
> Any thoughts or advice?
>

i dont think you'll have much luck taking the "spread data evenly throught the
partitions" approach; figure out how best to segment your data into
manageable chunks.  HTH.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Re-partitioning huge schema

From

Erik Jones

Date:

05 December 2007, 21:29:42

Rober,

Thank you once again for your input.

On Dec 5, 2007, at 3:23 PM, Robert Treat wrote:

> On Monday 03 December 2007 17:32, Erik Jones wrote:
<snip> Too much to keep quoted here.  Check the archives if you want
to read more about the setup for this conversation.

>> -[ RECORD 1 ]----+-----------------------
>> ttype                          | messages_history_opens
>> total_tables              | 14027
>> total_tuples              |   139284528
>> max_tuples              |   2760599
>> avg_num_tuples    |    9929.84
>> std_dev_from_avg |    59945.51
>>
>> Now, for this discussion, let me also point out that we've had a
>> pretty steady growth rate of about 150% per year since the company
>> opened 5 years ago and, for the sake of argument, we'll use that here
>> although we really don't have any guarantees past personal faith that
>> we can maintaing that :)
>>
>> So, I'm looking at both a bin partiioning system or range based on
>> the date timestamps and both seem to have their pros and cons.  For
>> the bin example, for this table type if I set it at 300 partitions it
>> will take approximately 5 years before any of the partitions reaches
>> the size of our current largest opens table.  This is obviously very
>> attractive from a management perspective and has the added advantage
>> that I could probably expect the spread of data (volume per table) to
>> be pretty even over time.
>
> What would you base these bins on?  If you are doing it based on
> the surrogate
> key, then your going to spread data across both dates and accounts
> into the
> bins, which seems like it would make the majority of your queries
> not use
> partitions smartly.

Given that these are the userdata tables, we don't run any queries
without there being an account context.  We run queries that answer
questions like: "How many opens did this mailing have?"  or "Show me
all members in this account who have opened in the last six
month...".  We also keep system-wide a summary totals table that
allows us to run global system stats queries.

>> However, it's those times somebody wants
>> to ask, "Show me all my members who have opened in the last year"
>> that it becomes a problem as that data would most likely be spread
>> over a radically varying number of partitions.  I could probably
>> solve that by making the partition based on modulo account_id.
>>
>
> if your bins are based on account_id that might run you the risk of
> filling up
> the bins disproprotionally.

While this is a possibility, there's ways to manage this.  The best
way is to use a check constraint along the lines of:

CHECK account_id % 100 = 1 and account_id NOT IN some_func()

where some_func returns an array of account's that are large enough
to have their own partitions.  For example, with this opens example
we have only 32 accounts out of over 14,000 with a tuple count over
500,000.  Filtering those out of the same query I used before gives:

-[ RECORD 1 ]----+-----------------------
ttype            | messages_history_opens
total_tables     | 14072
total_tuples     |   110349736
max_tuples       |    495600
avg_num_tuples   |    7841.83
std_dev_from_avg |    30860.06

So, after spreading about 14K accounts over 100 bins with those
statistics, their sizes will be fairly even.  Whenever we have an
account grow disproportionately large we can separate them into their
own partition with a simple check constraint of account_id = dddd and
make sure that dddd is now returned in the array from some_func().

Actually, I'd be splitting accounts out based on their audience  and
mailing sizes as that's what will determine their growth rates.

>> The other option, of course, is to go with the "standard" (my quotes
>> based on what I've seen on this and the performance list) range based
>> partitioning.  However, it really wouldn't make a lot of sense to
>> make those partitions based on the timestamp fields as the data for
>> one mailing could then span many of the partitions and the same
>> question I noted in the last paragraph would again result in hitting
>> potentially many (although not as many as before) partitions.
>
> the thing to think about is going one step past the CE. What is
> better, an
> index lookup based on time across account_id based partitions, or
> and index
> lookup on account_ids in time based partitions.

See my answer above about our summary table.

>> So,
>> what I'm left with here is to partition on id ranges but, in order to
>> maintain a somewhat stable maximum partition size, I'd have to play a
>> statistical guessing game wrt the size previous month's (or week's)
>> partitions grew to v. their defined id ranges.  That's doable, but
>> not very attractive.  To be honest, I'm not even convinced that
>> that's something I'd really need to worry about on a month-to-month
>> basis.
>>
>
> I dont think it is. And remember you don't have to keep your date
> partitions
> as equal intervals.
>
>> Seen from one perspective, partitioning on id ranges v. using the bin
>> partitioning method are kind of similar.  The biggest differences
>> being that with the range setup the majority of the working set of
>> data will be in the last couple of month's partitions whereas with
>> the bin method it will be spread pretty evenly across all of them,
>> i.e. lots of pages for a couple pages constantly being worked with v.
>> a couple pages from a lot of tables constantly being worked with.
>>
>> Any thoughts or advice?
>>
>
> i dont think you'll have much luck taking the "spread data evenly
> throught the
> partitions" approach; figure out how best to segment your data into
> manageable chunks.  HTH.

I agree.  That's also why I'm not too worried about taking a bin
based approach to the partitioning.  In addition, once this is done,
however I end up doing it, I'm going to be working on a horizontal
systems scaling approach wherein I set up another database, replicate
the global tables (all 132 of them) to it, and split the accounts'
userdata between the two databases -- fun, fun, fun!

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

Re: Re-partitioning huge schema

From

Robert Treat

Date:

07 December 2007, 16:54:25

On Wednesday 05 December 2007 17:29, Erik Jones wrote:
> > i dont think you'll have much luck taking the "spread data evenly
> > throught the
> > partitions" approach; figure out how best to segment your data into
> > manageable chunks.  HTH.
>
> I agree.  That's also why I'm not too worried about taking a bin
> based approach to the partitioning.  In addition, once this is done,
> however I end up doing it, I'm going to be working on a horizontal
> systems scaling approach wherein I set up another database, replicate
> the global tables (all 132 of them) to it, and split the accounts'
> userdata between the two databases -- fun, fun, fun!
>

A few of us have been playing with using plproxy in partitioning triggers to
also achieve horizontal scaling at the same time, you might want to look into
that. (Personally I've not yet landed a customer who needs such a solution,
but hopefully soon)

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: Re-partitioning huge schema

From

Erik Jones

Date:

07 December 2007, 17:02:05

On Dec 7, 2007, at 11:53 AM, Robert Treat wrote:

> On Wednesday 05 December 2007 17:29, Erik Jones wrote:
>>> i dont think you'll have much luck taking the "spread data evenly
>>> throught the
>>> partitions" approach; figure out how best to segment your data into
>>> manageable chunks.  HTH.
>>
>> I agree.  That's also why I'm not too worried about taking a bin
>> based approach to the partitioning.  In addition, once this is done,
>> however I end up doing it, I'm going to be working on a horizontal
>> systems scaling approach wherein I set up another database, replicate
>> the global tables (all 132 of them) to it, and split the accounts'
>> userdata between the two databases -- fun, fun, fun!
>>
>
> A few of us have been playing with using plproxy in partitioning
> triggers to
> also achieve horizontal scaling at the same time, you might want to
> look into
> that. (Personally I've not yet landed a customer who needs such a
> solution,
> but hopefully soon)

Yeah, when I get to that point I'll definitely be testing with that
as that will definitely make things simpler from the application's
perspective.  First I have to fix our current schema though :(

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

Re: Re-partitioning huge schema

From

"Joshua D. Drake"

Date:

07 December 2007, 19:43:49

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 07 Dec 2007 12:53:45 -0500
Robert Treat <xzilla@users.sourceforge.net> wrote:

> On Wednesday 05 December 2007 17:29, Erik Jones wrote:
> > > i dont think you'll have much luck taking the "spread data evenly
> > > throught the
> > > partitions" approach; figure out how best to segment your data
> > > into manageable chunks.  HTH.
> >
> > I agree.  That's also why I'm not too worried about taking a bin
> > based approach to the partitioning.  In addition, once this is done,
> > however I end up doing it, I'm going to be working on a horizontal
> > systems scaling approach wherein I set up another database,
> > replicate the global tables (all 132 of them) to it, and split the
> > accounts' userdata between the two databases -- fun, fun, fun!
> >
> 
> A few of us have been playing with using plproxy in partitioning
> triggers to also achieve horizontal scaling at the same time, you
> might want to look into that. (Personally I've not yet landed a
> customer who needs such a solution, but hopefully soon) 
> 

I am in the process of a couple of prototype deployments for this. It
is very, very cool. I particularly like the ability to define which
nodes particular functions will be run on allowing you to have a
centralized master for your connect manager that can interact with any
number of various nodes based on function call.

Sincerely,


Joshua D. Drake


- -- 

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
            UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHWbCVATb/zqfZUUQRAhcjAJ4sPDm1IwNMo5SyiQ4bod/XXwEomwCfcMZL
jGGjwJr/vZXgyZINY+GKMdQ=
=eGxt
-----END PGP SIGNATURE-----