Thread: Thousands of tables versus on table?

Thousands of tables versus on table?

From
Thomas Andrews
Date:
I have several thousand clients.  Our clients do surveys, and each survey
has two tables for the client data,

   responders
   responses

Frequent inserts into both table.

Right now, we are seeing significant time during inserts to these two
tables.

Some of the indices in tableA and tableB do not index on the client ID
first.

So, we are considering two possible solutions.

 (1) Create separate responders and responses tables for each client.

 (2) Make sure all indices on responders and responses start with the
  client id (excepting, possibly, the primary keys for these fields) and
  have all normal operation queries always include an id_client.

Right now, for example, given a responder and a survey question, we do a
query in responses by the id_responder and id_survey.  This gives us a
unique record, but I'm wondering if maintaining the index on
(id_responder,id_survey) is more costly on inserts than maintaining the
index (id_client,id_responder,id_survey) given that we also have other
indices on (id_client,...).

Option (1) makes me very nervous.  I don't like the idea of the same sorts
of data being stored in lots of different tables, in part for long-term
maintenance reasons.  We don't really need cross-client reporting, however.

=thomas


Re: Thousands of tables versus on table?

From
Mark Lewis
Date:
On Mon, 2007-06-04 at 13:40 -0400, Thomas Andrews wrote:
> I have several thousand clients.  Our clients do surveys, and each survey
> has two tables for the client data,
>
>    responders
>    responses
>
> Frequent inserts into both table.
>
> Right now, we are seeing significant time during inserts to these two
> tables.

Can you provide some concrete numbers here?  Perhaps an EXPLAIN ANALYZE
for the insert, sizes of tables, stuff like that?

> Some of the indices in tableA and tableB do not index on the client ID
> first.

What reason do you have to think that this matters?

> So, we are considering two possible solutions.
>
>  (1) Create separate responders and responses tables for each client.
>
>  (2) Make sure all indices on responders and responses start with the
>   client id (excepting, possibly, the primary keys for these fields) and
>   have all normal operation queries always include an id_client.
>
> Right now, for example, given a responder and a survey question, we do a
> query in responses by the id_responder and id_survey.  This gives us a
> unique record, but I'm wondering if maintaining the index on
> (id_responder,id_survey) is more costly on inserts than maintaining the
> index (id_client,id_responder,id_survey) given that we also have other
> indices on (id_client,...).
>
> Option (1) makes me very nervous.  I don't like the idea of the same sorts
> of data being stored in lots of different tables, in part for long-term
> maintenance reasons.  We don't really need cross-client reporting, however.

What version of PG is this?  What is your vacuuming strategy?  Have you
tried a REINDEX to see if that helps?

-- Mark Lewis

Re: Thousands of tables versus on table?

From
Thomas Andrews
Date:
Oh, and we vacuum every day.  Not sure about REINDEX, but I doubt we
have done that.

=thomas

Mark Lewis wrote:
> On Mon, 2007-06-04 at 13:40 -0400, Thomas Andrews wrote:
>> I have several thousand clients.  Our clients do surveys, and each survey
>> has two tables for the client data,
>>
>>    responders
>>    responses
>>
>> Frequent inserts into both table.
>>
>> Right now, we are seeing significant time during inserts to these two
>> tables.
>
> Can you provide some concrete numbers here?  Perhaps an EXPLAIN ANALYZE
> for the insert, sizes of tables, stuff like that?
>
>> Some of the indices in tableA and tableB do not index on the client ID
>> first.
>
> What reason do you have to think that this matters?
>
>> So, we are considering two possible solutions.
>>
>>  (1) Create separate responders and responses tables for each client.
>>
>>  (2) Make sure all indices on responders and responses start with the
>>   client id (excepting, possibly, the primary keys for these fields) and
>>   have all normal operation queries always include an id_client.
>>
>> Right now, for example, given a responder and a survey question, we do a
>> query in responses by the id_responder and id_survey.  This gives us a
>> unique record, but I'm wondering if maintaining the index on
>> (id_responder,id_survey) is more costly on inserts than maintaining the
>> index (id_client,id_responder,id_survey) given that we also have other
>> indices on (id_client,...).
>>
>> Option (1) makes me very nervous.  I don't like the idea of the same sorts
>> of data being stored in lots of different tables, in part for long-term
>> maintenance reasons.  We don't really need cross-client reporting, however.
>
> What version of PG is this?  What is your vacuuming strategy?  Have you
> tried a REINDEX to see if that helps?
>
> -- Mark Lewis
>


Attachment

Re: Thousands of tables versus on table?

From
Thomas Andrews
Date:
We're running 7.4 but will be upgrading to 8.2.

The responses table has 20,000,000 records.

Sometimes (but not all the time) an insert into the responses table can
take 5-6 seconds.

I guess my real question is, does it ever make sense to create thousands
of tables like this?

=thomas

Mark Lewis wrote:
> On Mon, 2007-06-04 at 13:40 -0400, Thomas Andrews wrote:
>> I have several thousand clients.  Our clients do surveys, and each survey
>> has two tables for the client data,
>>
>>    responders
>>    responses
>>
>> Frequent inserts into both table.
>>
>> Right now, we are seeing significant time during inserts to these two
>> tables.
>
> Can you provide some concrete numbers here?  Perhaps an EXPLAIN ANALYZE
> for the insert, sizes of tables, stuff like that?
>
>> Some of the indices in tableA and tableB do not index on the client ID
>> first.
>
> What reason do you have to think that this matters?
>
>> So, we are considering two possible solutions.
>>
>>  (1) Create separate responders and responses tables for each client.
>>
>>  (2) Make sure all indices on responders and responses start with the
>>   client id (excepting, possibly, the primary keys for these fields) and
>>   have all normal operation queries always include an id_client.
>>
>> Right now, for example, given a responder and a survey question, we do a
>> query in responses by the id_responder and id_survey.  This gives us a
>> unique record, but I'm wondering if maintaining the index on
>> (id_responder,id_survey) is more costly on inserts than maintaining the
>> index (id_client,id_responder,id_survey) given that we also have other
>> indices on (id_client,...).
>>
>> Option (1) makes me very nervous.  I don't like the idea of the same sorts
>> of data being stored in lots of different tables, in part for long-term
>> maintenance reasons.  We don't really need cross-client reporting, however.
>
> What version of PG is this?  What is your vacuuming strategy?  Have you
> tried a REINDEX to see if that helps?
>
> -- Mark Lewis
>


Attachment

Re: Thousands of tables versus on table?

From
Gregory Stark
Date:
"Thomas Andrews" <tandrews@soliantconsulting.com> writes:

> I guess my real question is, does it ever make sense to create thousands of
> tables like this?

Sometimes. But usually it's not a good idea.

What you're proposing is basically partitioning, though you may not actually
need to put all the partitions together for your purposes. Partitioning's main
benefit is in the management of the data. You can drop and load partitions in
chunks rather than have to perform large operations on millions of records.

Postgres doesn't really get any faster by breaking the tables up like that. In
fact it probably gets slower as it has to look up which of the thousands of
tables you want to work with.

How often do you update or delete records and how many do you update or
delete? Once per day is a very low frequency for vacuuming a busy table, you
may be suffering from table bloat. But if you never delete or update records
then that's irrelevant.

Does reindexing or clustering the table make a marked difference?

I would suggest you post your schema and the results of "vacuum verbose".

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Thousands of tables versus on table?

From
Thomas Andrews
Date:


On 6/4/07 3:43 PM, "Gregory Stark" <stark@enterprisedb.com> wrote:

>
> "Thomas Andrews" <tandrews@soliantconsulting.com> writes:
>
>> I guess my real question is, does it ever make sense to create thousands of
>> tables like this?
>
> Sometimes. But usually it's not a good idea.
>
> What you're proposing is basically partitioning, though you may not actually
> need to put all the partitions together for your purposes. Partitioning's main
> benefit is in the management of the data. You can drop and load partitions in
> chunks rather than have to perform large operations on millions of records.
>
> Postgres doesn't really get any faster by breaking the tables up like that. In
> fact it probably gets slower as it has to look up which of the thousands of
> tables you want to work with.
>
> How often do you update or delete records and how many do you update or
> delete? Once per day is a very low frequency for vacuuming a busy table, you
> may be suffering from table bloat. But if you never delete or update records
> then that's irrelevant.

It looks like the most inserts that have occurred in a day is about 2000.
The responders table has 1.3 million records, the responses table has 50
million records.  Most of the inserts are in the responses table.

>
> Does reindexing or clustering the table make a marked difference?
>

Clustering sounds like it might be a really good solution.  How long does a
cluster command usually take on a table with 50,000,000 records?  Is it
something that can be run daily/weekly?

I'd rather not post the schema because it's not mine - I'm a consultant.  I
can tell you our vacuum every night is taking 2 hours and that disk IO is
the real killer - the CPU rarely gets higher than 20% or so.

=thomas


Re: Thousands of tables versus on table?

From
"Y Sidhu"
Date:
On 6/4/07, Thomas Andrews <tandrews@soliantconsulting.com> wrote:



On 6/4/07 3:43 PM, "Gregory Stark" <stark@enterprisedb.com> wrote:

>
> "Thomas Andrews" < tandrews@soliantconsulting.com> writes:
>
>> I guess my real question is, does it ever make sense to create thousands of
>> tables like this?
>
> Sometimes. But usually it's not a good idea.
>
> What you're proposing is basically partitioning, though you may not actually
> need to put all the partitions together for your purposes. Partitioning's main
> benefit is in the management of the data. You can drop and load partitions in
> chunks rather than have to perform large operations on millions of records.
>
> Postgres doesn't really get any faster by breaking the tables up like that. In
> fact it probably gets slower as it has to look up which of the thousands of
> tables you want to work with.
>
> How often do you update or delete records and how many do you update or
> delete? Once per day is a very low frequency for vacuuming a busy table, you
> may be suffering from table bloat. But if you never delete or update records
> then that's irrelevant.

It looks like the most inserts that have occurred in a day is about 2000.
The responders table has 1.3 million records, the responses table has 50
million records.  Most of the inserts are in the responses table.

>
> Does reindexing or clustering the table make a marked difference?
>

Clustering sounds like it might be a really good solution.  How long does a
cluster command usually take on a table with 50,000,000 records?  Is it
something that can be run daily/weekly?

I'd rather not post the schema because it's not mine - I'm a consultant.  I
can tell you our vacuum every night is taking 2 hours and that disk IO is
the real killer - the CPU rarely gets higher than 20% or so.

=thomas


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings


What OS are you running on?


--
Yudhvir Singh Sidhu
408 375 3134 cell

Re: Thousands of tables versus on table?

From
Thomas Andrews
Date:
Linux 2.4.9, if I’m reading this right.

=thomas


On 6/4/07 4:08 PM, "Y Sidhu" <ysidhu@gmail.com> wrote:

On 6/4/07, Thomas Andrews <tandrews@soliantconsulting.com> wrote:



On 6/4/07 3:43 PM, "Gregory Stark" <stark@enterprisedb.com> wrote:

>
> "Thomas Andrews" < tandrews@soliantconsulting.com <mailto:tandrews@soliantconsulting.com> > writes:
>
>> I guess my real question is, does it ever make sense to create thousands of
>> tables like this?
>
> Sometimes. But usually it's not a good idea.
>
> What you're proposing is basically partitioning, though you may not actually
> need to put all the partitions together for your purposes. Partitioning's main
> benefit is in the management of the data. You can drop and load partitions in
> chunks rather than have to perform large operations on millions of records.
>
> Postgres doesn't really get any faster by breaking the tables up like that. In
> fact it probably gets slower as it has to look up which of the thousands of
> tables you want to work with.
>
> How often do you update or delete records and how many do you update or
> delete? Once per day is a very low frequency for vacuuming a busy table, you
> may be suffering from table bloat. But if you never delete or update records
> then that's irrelevant.

It looks like the most inserts that have occurred in a day is about 2000.
The responders table has 1.3 million records, the responses table has 50
million records.  Most of the inserts are in the responses table.

>
> Does reindexing or clustering the table make a marked difference?
>

Clustering sounds like it might be a really good solution.  How long does a
cluster command usually take on a table with 50,000,000 records?  Is it
something that can be run daily/weekly?

I'd rather not post the schema because it's not mine - I'm a consultant.  I
can tell you our vacuum every night is taking 2 hours and that disk IO is
the real killer - the CPU rarely gets higher than 20% or so.

=thomas


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings


What OS are you running on?


Re: Thousands of tables versus on table?

From
PFC
Date:

> can tell you our vacuum every night is taking 2 hours and that disk IO is
> the real killer - the CPU rarely gets higher than 20% or so.

    How many gigabytes of stuff do you have in this database ?
    ( du -sh on the *right* directory will suffice, don't include the logs
etc, aim for data/base/oid)


Re: Thousands of tables versus on table?

From
Gregory Stark
Date:
"Thomas Andrews" <tandrews@soliantconsulting.com> writes:

> Clustering sounds like it might be a really good solution.  How long does a
> cluster command usually take on a table with 50,000,000 records?  Is it
> something that can be run daily/weekly?

ouch, ok, with 50M records cluster isn't going to be quick either, especially
if you have a lot of indexes.

With those kinds of numbers and with the kind of workload you're describing
where you have different areas that are really complete separate you might
consider partitioning the table. That's essentially what you're proposing
anyways.

Honestly table partitioning in Postgres is pretty young and primitive and if
you have the flexibility in your application to refer to different tables
without embedding them throughout your application then you might consider
that. But there are also advantages to being able to select from all the
tables together using the partitioned table.

> I'd rather not post the schema because it's not mine - I'm a consultant.  I
> can tell you our vacuum every night is taking 2 hours and that disk IO is
> the real killer - the CPU rarely gets higher than 20% or so.

Do you ever update or delete these records? If you never update or delete
records then the vacuum is mostly a waste of effort anyways. (You still have
to vacuum occasionally to prevent xid wraparound but that's much much less
often).

If you do delete records in large batches or have lots of updates then
vacuuming daily with default fsm settings probably isn't enough.

How many indexes do you have?

And if they don't all have client_id in their prefix then I wonder about the
plans you're getting. It's unfortunate you can't post your schema and query
plans. It's possible you have some plans that are processing many more records
than they need to to do their work because they're using indexes or
combinations of indexes that aren't ideal.
specific enough

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Thousands of tables versus on table?

From
Scott Marlowe
Date:
Gregory Stark wrote:
> "Thomas Andrews" <tandrews@soliantconsulting.com> writes:
>
>
>> I guess my real question is, does it ever make sense to create thousands of
>> tables like this?
>>
>
> Sometimes. But usually it's not a good idea.
>
> What you're proposing is basically partitioning, though you may not actually
> need to put all the partitions together for your purposes. Partitioning's main
> benefit is in the management of the data. You can drop and load partitions in
> chunks rather than have to perform large operations on millions of records.
>
> Postgres doesn't really get any faster by breaking the tables up like that. In
> fact it probably gets slower as it has to look up which of the thousands of
> tables you want to work with.
>

That's not entirely true.  PostgreSQL can be markedly faster using
partitioning as long as you always access it by referencing the
partitioning key in the where clause.  So, if you partition the table by
date, and always reference it with a date in the where clause, it will
usually be noticeably faster.  OTOH, if you access it without using a
where clause that lets it pick partitions, then it will be slower than
one big table.

So, while this poster might originally think to have one table for each
user, resulting in thousands of tables, maybe a compromise where you
partition on userid ranges would work out well, and keep each partition
table down to some 50-100 thousand rows, with smaller indexes to match.

Re: Thousands of tables versus on table?

From
david@lang.hm
Date:
On Mon, 4 Jun 2007, Scott Marlowe wrote:

> Gregory Stark wrote:
>>  "Thomas Andrews" <tandrews@soliantconsulting.com> writes:
>>
>>
>> >  I guess my real question is, does it ever make sense to create thousands
>> >  of
>> >  tables like this?
>> >
>>
>>  Sometimes. But usually it's not a good idea.
>>
>>  What you're proposing is basically partitioning, though you may not
>>  actually
>>  need to put all the partitions together for your purposes. Partitioning's
>>  main
>>  benefit is in the management of the data. You can drop and load partitions
>>  in
>>  chunks rather than have to perform large operations on millions of
>>  records.
>>
>>  Postgres doesn't really get any faster by breaking the tables up like
>>  that. In
>>  fact it probably gets slower as it has to look up which of the thousands
>>  of
>>  tables you want to work with.
>>
>
> That's not entirely true.  PostgreSQL can be markedly faster using
> partitioning as long as you always access it by referencing the partitioning
> key in the where clause.  So, if you partition the table by date, and always
> reference it with a date in the where clause, it will usually be noticeably
> faster.  OTOH, if you access it without using a where clause that lets it
> pick partitions, then it will be slower than one big table.
>
> So, while this poster might originally think to have one table for each user,
> resulting in thousands of tables, maybe a compromise where you partition on
> userid ranges would work out well, and keep each partition table down to some
> 50-100 thousand rows, with smaller indexes to match.
>

what if he doesn't use the postgres internal partitioning, but instead
makes his code access the tables named responsesNNNNN where NNNNN is the
id of the customer?

this is what it sounded like he was asking initially.

David Lang

Re: Thousands of tables versus on table?

From
Scott Marlowe
Date:
david@lang.hm wrote:
> On Mon, 4 Jun 2007, Scott Marlowe wrote:
>
>> Gregory Stark wrote:
>>>  "Thomas Andrews" <tandrews@soliantconsulting.com> writes:
>>>
>>>
>>> >  I guess my real question is, does it ever make sense to create
>>> thousands >  of
>>> >  tables like this?
>>> >
>>>  Sometimes. But usually it's not a good idea.
>>>  What you're proposing is basically partitioning, though you may not
>>>  actually
>>>  need to put all the partitions together for your purposes.
>>> Partitioning's
>>>  main
>>>  benefit is in the management of the data. You can drop and load
>>> partitions
>>>  in
>>>  chunks rather than have to perform large operations on millions of
>>>  records.
>>>
>>>  Postgres doesn't really get any faster by breaking the tables up like
>>>  that. In
>>>  fact it probably gets slower as it has to look up which of the
>>> thousands
>>>  of
>>>  tables you want to work with.
>>>
>>
>> That's not entirely true.  PostgreSQL can be markedly faster using
>> partitioning as long as you always access it by referencing the
>> partitioning key in the where clause.  So, if you partition the table
>> by date, and always reference it with a date in the where clause, it
>> will usually be noticeably faster.  OTOH, if you access it without
>> using a where clause that lets it pick partitions, then it will be
>> slower than one big table.
>>
>> So, while this poster might originally think to have one table for
>> each user, resulting in thousands of tables, maybe a compromise where
>> you partition on userid ranges would work out well, and keep each
>> partition table down to some 50-100 thousand rows, with smaller
>> indexes to match.
>>
>
> what if he doesn't use the postgres internal partitioning, but instead
> makes his code access the tables named responsesNNNNN where NNNNN is
> the id of the customer?
>
> this is what it sounded like he was asking initially.

Sorry, I think I initially read your response as "Postgres doesn't
really get any faster by breaking the tables up" without the "like that"
part.

I've found that as long as the number of tables is > 10,000 or so,
having a lot of tables doesn't seem to really slow pgsql down a lot.
I'm sure that the tipping point is dependent on your db machine.  I
would bet that if he's referring to individual tables directly, and each
one has hundreds instead of millions of rows, the performance would be
better.  But the only way to be sure is to test it.

Re: Thousands of tables versus on table?

From
Thomas Andrews
Date:
So, partitioning in PSQL 8 is workable, but breaking up the table up into
actual separate tables is not?

Another solution we have proposed is having 'active' and 'completed' tables.
So, rather than thousands, we'd have four tables:

   responders_active
   responders_completed
   responses_active
   responses_completed

That way, the number of responses_active records would not be as huge.  The
problem, as we see it, is that the responders are entering their responses
and it is taking too long.  But if we separate out active and  completed
surveys, then the inserts will likely cost less.  We might even be able to
reduce the indices on the _active tables because survey administrators would
not want to run as many complex reports on the active responses.

There would be an extra cost, when the survey is completed, of copying the
records from the '_active' table to the '_completed' table and then deleting
them, but that operation is something a survey administrator would be
willing to accept as taking a while (as well as something we could put off
to an off hour, although we have lots of international customers so it's not
clear when our off hours are.)

=thomas


On 6/5/07 12:48 PM, "Scott Marlowe" <smarlowe@g2switchworks.com> wrote:

> david@lang.hm wrote:
>> On Mon, 4 Jun 2007, Scott Marlowe wrote:
>>
>>> Gregory Stark wrote:
>>>>  "Thomas Andrews" <tandrews@soliantconsulting.com> writes:
>>>>
>>>>
>>>>>  I guess my real question is, does it ever make sense to create
>>>> thousands >  of
>>>>>  tables like this?
>>>>>
>>>>  Sometimes. But usually it's not a good idea.
>>>>  What you're proposing is basically partitioning, though you may not
>>>>  actually
>>>>  need to put all the partitions together for your purposes.
>>>> Partitioning's
>>>>  main
>>>>  benefit is in the management of the data. You can drop and load
>>>> partitions
>>>>  in
>>>>  chunks rather than have to perform large operations on millions of
>>>>  records.
>>>>
>>>>  Postgres doesn't really get any faster by breaking the tables up like
>>>>  that. In
>>>>  fact it probably gets slower as it has to look up which of the
>>>> thousands
>>>>  of
>>>>  tables you want to work with.
>>>>
>>>
>>> That's not entirely true.  PostgreSQL can be markedly faster using
>>> partitioning as long as you always access it by referencing the
>>> partitioning key in the where clause.  So, if you partition the table
>>> by date, and always reference it with a date in the where clause, it
>>> will usually be noticeably faster.  OTOH, if you access it without
>>> using a where clause that lets it pick partitions, then it will be
>>> slower than one big table.
>>>
>>> So, while this poster might originally think to have one table for
>>> each user, resulting in thousands of tables, maybe a compromise where
>>> you partition on userid ranges would work out well, and keep each
>>> partition table down to some 50-100 thousand rows, with smaller
>>> indexes to match.
>>>
>>
>> what if he doesn't use the postgres internal partitioning, but instead
>> makes his code access the tables named responsesNNNNN where NNNNN is
>> the id of the customer?
>>
>> this is what it sounded like he was asking initially.
>
> Sorry, I think I initially read your response as "Postgres doesn't
> really get any faster by breaking the tables up" without the "like that"
> part.
>
> I've found that as long as the number of tables is > 10,000 or so,
> having a lot of tables doesn't seem to really slow pgsql down a lot.
> I'm sure that the tipping point is dependent on your db machine.  I
> would bet that if he's referring to individual tables directly, and each
> one has hundreds instead of millions of rows, the performance would be
> better.  But the only way to be sure is to test it.


Re: Thousands of tables versus on table?

From
Scott Marlowe
Date:
Thomas Andrews wrote:
>
>
> On 6/5/07 12:48 PM, "Scott Marlowe" <smarlowe@g2switchworks.com> wrote:
>
>
>> david@lang.hm wrote:
>>
>>> On Mon, 4 Jun 2007, Scott Marlowe wrote:
>>>
>>>
>>>> Gregory Stark wrote:
>>>>
>>>>>  "Thomas Andrews" <tandrews@soliantconsulting.com> writes:
>>>>>
>>>>>
>>>>>
>>>>>>  I guess my real question is, does it ever make sense to create
>>>>>>
>>>>> thousands >  of
>>>>>
>>>>>>  tables like this?
>>>>>>
>>>>>>
>>>>>  Sometimes. But usually it's not a good idea.
>>>>>  What you're proposing is basically partitioning, though you may not
>>>>>  actually
>>>>>  need to put all the partitions together for your purposes.
>>>>> Partitioning's
>>>>>  main
>>>>>  benefit is in the management of the data. You can drop and load
>>>>> partitions
>>>>>  in
>>>>>  chunks rather than have to perform large operations on millions of
>>>>>  records.
>>>>>
>>>>>  Postgres doesn't really get any faster by breaking the tables up like
>>>>>  that. In
>>>>>  fact it probably gets slower as it has to look up which of the
>>>>> thousands
>>>>>  of
>>>>>  tables you want to work with.
>>>>>
>>>>>
>>>> That's not entirely true.  PostgreSQL can be markedly faster using
>>>> partitioning as long as you always access it by referencing the
>>>> partitioning key in the where clause.  So, if you partition the table
>>>> by date, and always reference it with a date in the where clause, it
>>>> will usually be noticeably faster.  OTOH, if you access it without
>>>> using a where clause that lets it pick partitions, then it will be
>>>> slower than one big table.
>>>>
>>>> So, while this poster might originally think to have one table for
>>>> each user, resulting in thousands of tables, maybe a compromise where
>>>> you partition on userid ranges would work out well, and keep each
>>>> partition table down to some 50-100 thousand rows, with smaller
>>>> indexes to match.
>>>>
>>>>
>>> what if he doesn't use the postgres internal partitioning, but instead
>>> makes his code access the tables named responsesNNNNN where NNNNN is
>>> the id of the customer?
>>>
>>> this is what it sounded like he was asking initially.
>>>
>> Sorry, I think I initially read your response as "Postgres doesn't
>> really get any faster by breaking the tables up" without the "like that"
>> part.
>>
>> I've found that as long as the number of tables is > 10,000 or so,
>>
That should have been as long as the number of tables is < 10,000 or so...

>> having a lot of tables doesn't seem to really slow pgsql down a lot.
>> I'm sure that the tipping point is dependent on your db machine.  I
>> would bet that if he's referring to individual tables directly, and each
>> one has hundreds instead of millions of rows, the performance would be
>> better.  But the only way to be sure is to test it.
>>
>
>
Please stop top posting.  This is my last reply until you stop top posting.

> So, partitioning in PSQL 8 is workable, but breaking up the table up into
> actual separate tables is not?
>
Ummm, that's not what I said.  They're similar in execution.  However,
partitioning might let you put 100 customers into a given table, if,
say, you partitioned on customer ID or something that would allow you to
group a few together.
> Another solution we have proposed is having 'active' and 'completed' tables.
> So, rather than thousands, we'd have four tables:
>
>    responders_active
>    responders_completed
>    responses_active
>    responses_completed
>
That's not a bad idea.  Just keep up on your vacuuming.

Re: Thousands of tables versus on table?

From
Gregory Stark
Date:
"Scott Marlowe" <smarlowe@g2switchworks.com> writes:

> Sorry, I think I initially read your response as "Postgres doesn't really get
> any faster by breaking the tables up" without the "like that" part.

Well breaking up the tables like that or partitioning, either way should be
about equivalent really. Breaking up the tables and doing it in the
application should perform even better but it does make the schema less
flexible and harder to do non-partition based queries and so on.

I guess I should explain what I originally meant: A lot of people come from a
flat-file world and assume that things get slower when you deal with large
tables. In fact due to the magic of log(n) accessing records from a large
index is faster than first looking up the table and index info in a small
index and then doing a second lookup in up in an index for a table half the
size.

Where the win in partitioning comes in is in being able to disappear some of
the data entirely. By making part of the index key implicit in the choice of
partition you get away with a key that's half as large. And in some cases you
can get away with using a different key entirely which wouldn't otherwise have
been feasible to index. In some cases you can even do sequential scans whereas
in an unpartitioned table you would have to use an index (or scan the entire
table).

But the real reason people partition data is really for the management ease.
Being able to drop, and load entire partitions in O(1) is makes it feasible to
manage data on a scale that would simply be impossible without partitioned
tables.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Thousands of tables versus on table?

From
Scott Marlowe
Date:
Gregory Stark wrote:
> "Scott Marlowe" <smarlowe@g2switchworks.com> writes:
>
>
>> Sorry, I think I initially read your response as "Postgres doesn't really get
>> any faster by breaking the tables up" without the "like that" part.
>>
>
> Well breaking up the tables like that or partitioning, either way should be
> about equivalent really. Breaking up the tables and doing it in the
> application should perform even better but it does make the schema less
> flexible and harder to do non-partition based queries and so on.
>
True, but we can break it up by something other than the company name on
the survey, in this instance, and might find it far easier to manage by,
say, date range, company ID range, etc...
Plus with a few hand rolled bash or perl scripts we can maintain our
database and keep all the logic of partitioning out of our app.  Which
would allow developers not wholly conversant in our partitioning scheme
to participate in development without the fear of them putting data in
the wrong place.
> Where the win in partitioning comes in is in being able to disappear some of
> the data entirely. By making part of the index key implicit in the choice of
> partition you get away with a key that's half as large. And in some cases you
> can get away with using a different key entirely which wouldn't otherwise have
> been feasible to index. In some cases you can even do sequential scans whereas
> in an unpartitioned table you would have to use an index (or scan the entire
> table).
>
Yeah, I found that out recently while I benchmarking a 12,000,000 row
geometric data set.  Breaking it into 400 or so partitions resulted in
no need for indexes and response times of 0.2 or so seconds, where
before that I'd been in the 1.5 to 3 second range.

Re: Thousands of tables versus on table?

From
david@lang.hm
Date:
On Tue, 5 Jun 2007, Gregory Stark wrote:

> "Scott Marlowe" <smarlowe@g2switchworks.com> writes:
>
>> Sorry, I think I initially read your response as "Postgres doesn't really get
>> any faster by breaking the tables up" without the "like that" part.
>
> Well breaking up the tables like that or partitioning, either way should be
> about equivalent really. Breaking up the tables and doing it in the
> application should perform even better but it does make the schema less
> flexible and harder to do non-partition based queries and so on.

but he said in the initial message that they don't do cross-customer
reports anyway, so there really isn't any non-partition based querying
going on anyway.

> I guess I should explain what I originally meant: A lot of people come from a
> flat-file world and assume that things get slower when you deal with large
> tables. In fact due to the magic of log(n) accessing records from a large
> index is faster than first looking up the table and index info in a small
> index and then doing a second lookup in up in an index for a table half the
> size.

however, if your query plan every does a sequential scan of a table then
you are nog doing a log(n) lookup are you?

> Where the win in partitioning comes in is in being able to disappear some of
> the data entirely. By making part of the index key implicit in the choice of
> partition you get away with a key that's half as large. And in some cases you
> can get away with using a different key entirely which wouldn't otherwise have
> been feasible to index. In some cases you can even do sequential scans whereas
> in an unpartitioned table you would have to use an index (or scan the entire
> table).
>
> But the real reason people partition data is really for the management ease.
> Being able to drop, and load entire partitions in O(1) is makes it feasible to
> manage data on a scale that would simply be impossible without partitioned
> tables.

remember that the origional question wasn't about partitioned tables, it
was about the performance problem he was having with one large table (slow
insert speed) and asking if postgres would collapse if he changed his
schema to use a seperate table per customer.

I see many cases where people advocate collapsing databases/tables
togeather by adding a column that indicates which customer the line is
for.

however I really don't understand why it is more efficiant to have a 5B
line table that you do a report/query against 0.1% of then it is to have
1000 different tables of 5M lines each and do a report/query against 100%
of. it would seem that the fact that you don't have to skip over 99.9% of
the data to find things that _may_ be relavent would have a noticable cost
in and of itself.

David Lang

Re: Thousands of tables versus on table?

From
Tom Lane
Date:
david@lang.hm writes:
> however I really don't understand why it is more efficiant to have a 5B
> line table that you do a report/query against 0.1% of then it is to have
> 1000 different tables of 5M lines each and do a report/query against 100%
> of.

Essentially what you are doing when you do that is taking the top few
levels of the index out of the database and putting it into the
filesystem; plus creating duplicative indexing information in the
database's system catalogs.

The degree to which this is a win is *highly* debatable, and certainly
depends on a whole lot of assumptions about filesystem performance.
You also need to assume that constraint-exclusion in the planner is
pretty doggone cheap relative to the table searches, which means it
almost certainly will lose badly if you carry the subdivision out to
the extent that the individual tables become small.  (This last could
be improved in some cases if we had a more explicit representation of
partitioning, but it'll never be as cheap as one more level of index
search.)

I think the main argument for partitioning is when you are interested in
being able to drop whole partitions cheaply.

            regards, tom lane

Re: Thousands of tables versus on table?

From
"Steinar H. Gunderson"
Date:
On Tue, Jun 05, 2007 at 05:59:25PM -0400, Tom Lane wrote:
> I think the main argument for partitioning is when you are interested in
> being able to drop whole partitions cheaply.

Wasn't there also talk about adding the ability to mark individual partitions
as read-only, thus bypassing MVCC and allowing queries to be satisfied using
indexes only?

Not that I think I've seen it on the TODO... :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Thousands of tables versus on table?

From
david@lang.hm
Date:
On Tue, 5 Jun 2007, Tom Lane wrote:

> david@lang.hm writes:
>> however I really don't understand why it is more efficiant to have a 5B
>> line table that you do a report/query against 0.1% of then it is to have
>> 1000 different tables of 5M lines each and do a report/query against 100%
>> of.
>
> Essentially what you are doing when you do that is taking the top few
> levels of the index out of the database and putting it into the
> filesystem; plus creating duplicative indexing information in the
> database's system catalogs.
>
> The degree to which this is a win is *highly* debatable, and certainly
> depends on a whole lot of assumptions about filesystem performance.
> You also need to assume that constraint-exclusion in the planner is
> pretty doggone cheap relative to the table searches, which means it
> almost certainly will lose badly if you carry the subdivision out to
> the extent that the individual tables become small.  (This last could
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
what is considered 'small'? a few thousand records, a few million records?

what multiplication factor would there need to be on the partitioning to
make it worth while? 100 tables, 1000 tables, 10000 tables?

the company that I'm at started out with a seperate database per customer
(not useing postgres), there are basicly zero cross-customer queries, with
a large volume of updates and lookups.

overall things have now grown to millions of updates/day (some multiple of
this in lookups), and ~2000 customers, with tens of millions of rows
between them.

having each one as a seperate database has really helped us over the years
as it's made it easy to scale (run 500 databases on each server instead of
1000, performance just doubled)

various people (not database experts) are pushing to install Oracle
cluster so that they can move all of these to one table with a customerID
column.

the database folks won't comment much on this either way, but they don't
seem enthusiastic to combine all the data togeather.

I've been on the side of things that said that seperate databases is
better becouse it improves data locality to only have to look at the data
for one customer at a time rather then having to pick out that customer's
data out from the mass of other, unrelated data.

> be improved in some cases if we had a more explicit representation of
> partitioning, but it'll never be as cheap as one more level of index
> search.)

say you have a billing table of
customerID, date, description, amount, tax, extended, paid

and you need to do things like
report on invoices that haven't been paied
summarize the amount billed each month
summarize the tax for each month

but you need to do this seperately for each customerID (not as a batch job
that reports on all customerID's at once, think a website where the
customer can request such reports at any time with a large variation in
criteria)

would you be able to just have one index on customerID and then another on
date? or would the second one need to be on customerID||date?

and would this process of going throught he index and seeking to the data
it points to really be faster then a sequential scan of just the data
related to that customerID?

> I think the main argument for partitioning is when you are interested in
> being able to drop whole partitions cheaply.

I fully understand this if you are doing queries across all the
partitions, but if your query is confined to a single partition,
especially in the case where you know ahead of time in the application
which 'partition' you care about it would seem that searching through
significantly less data should be a win.

David Lang

Re: Thousands of tables versus on table?

From
david@lang.hm
Date:
On Wed, 6 Jun 2007, Steinar H. Gunderson wrote:

> On Tue, Jun 05, 2007 at 05:59:25PM -0400, Tom Lane wrote:
>> I think the main argument for partitioning is when you are interested in
>> being able to drop whole partitions cheaply.
>
> Wasn't there also talk about adding the ability to mark individual partitions
> as read-only, thus bypassing MVCC and allowing queries to be satisfied using
> indexes only?
>
> Not that I think I've seen it on the TODO... :-)

now that's a very interesting idea, especially when combined with
time-based data where the old times will never change.

David Lang

Re: Thousands of tables versus on table?

From
Heikki Linnakangas
Date:
david@lang.hm wrote:
> On Wed, 6 Jun 2007, Steinar H. Gunderson wrote:
>
>> On Tue, Jun 05, 2007 at 05:59:25PM -0400, Tom Lane wrote:
>>> I think the main argument for partitioning is when you are interested in
>>> being able to drop whole partitions cheaply.
>>
>> Wasn't there also talk about adding the ability to mark individual
>> partitions
>> as read-only, thus bypassing MVCC and allowing queries to be satisfied
>> using
>> indexes only?
>>
>> Not that I think I've seen it on the TODO... :-)
>
> now that's a very interesting idea, especially when combined with
> time-based data where the old times will never change.

That's been discussed, but it's controversial. IMHO a better way to
achieve that is to design the dead-space-map so that it can be used to
check which parts of a table are visible to everyone, and skip
visibility checks. That doesn't require any user action, and allows updates.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Thousands of tables versus on table?

From
Scott Marlowe
Date:
david@lang.hm wrote:
> On Tue, 5 Jun 2007, Tom Lane wrote:
>
>> david@lang.hm writes:
>>> however I really don't understand why it is more efficiant to have a 5B
>>> line table that you do a report/query against 0.1% of then it is to
>>> have
>>> 1000 different tables of 5M lines each and do a report/query against
>>> 100%
>>> of.
>>
>> Essentially what you are doing when you do that is taking the top few
>> levels of the index out of the database and putting it into the
>> filesystem; plus creating duplicative indexing information in the
>> database's system catalogs.
>>
>> The degree to which this is a win is *highly* debatable, and certainly
>> depends on a whole lot of assumptions about filesystem performance.
>> You also need to assume that constraint-exclusion in the planner is
>> pretty doggone cheap relative to the table searches, which means it
>> almost certainly will lose badly if you carry the subdivision out to
>> the extent that the individual tables become small.  (This last could
>                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> what is considered 'small'? a few thousand records, a few million
> records?

I would say small is when the individual tables are in the 10 to 20
Megabyte range.  How many records that is depends on record width, of
course.  Basically, once the tables get small enough that you don't
really need indexes much, since you tend to grab 25% or more of each one
that you're going to hit in a query.

> what multiplication factor would there need to be on the partitioning
> to make it worth while? 100 tables, 1000 tables, 10000 tables?
Really depends on the size of the master table I think.  If the master
table is about 500 Megs in size, and you partition it down to about 1
meg per child table, you're probably ok.  Walking through 500 entries
for constraint exclusion seems pretty speedy from the tests I've run on
a 12M row table that was about 250 Megs, split into 200 to 400 or so
equisized child tables.  The time to retrieve 85,000 rows that were all
neighbors went from 2 to 6 seconds, to about 0.2 seconds, and we got rid
of indexes entirely since they weren't really needed anymore.

> the company that I'm at started out with a seperate database per
> customer (not useing postgres), there are basicly zero cross-customer
> queries, with a large volume of updates and lookups.
>
> overall things have now grown to millions of updates/day (some
> multiple of this in lookups), and ~2000 customers, with tens of
> millions of rows between them.
>
> having each one as a seperate database has really helped us over the
> years as it's made it easy to scale (run 500 databases on each server
> instead of 1000, performance just doubled)
I think that for what you're doing, partitioning at the database level
is probably a pretty good compromise solution.  Like you say, it's easy
to put busy databases on a new server to balance out the load.  Hardware
is cheap.

> various people (not database experts) are pushing to install Oracle
> cluster so that they can move all of these to one table with a
> customerID column.
Have these people identified a particular problem they're trying to
solve, or is this a religious issue for them?  From your description it
sounds like a matter of dogma, not problem solving.
> the database folks won't comment much on this either way, but they
> don't seem enthusiastic to combine all the data togeather.
I think they can see the fecal matter heading towards the rotational
cooling device on this one.  I can't imagine this being a win from the
perspective of saving the company money.


Re: Thousands of tables versus on table?

From
Scott Marlowe
Date:
Tom Lane wrote:
> The degree to which this is a win is *highly* debatable, and certainly
> depends on a whole lot of assumptions about filesystem performance.
> You also need to assume that constraint-exclusion in the planner is
> pretty doggone cheap relative to the table searches, which means it
> almost certainly will lose badly if you carry the subdivision out to
> the extent that the individual tables become small.  (This last could
> be improved in some cases if we had a more explicit representation of
> partitioning, but it'll never be as cheap as one more level of index
> search.)
I did some testing a while back on some of this, and with 400 or so
partitions, the select time was still very fast.

We were testing grabbing 50-80k rows from 12M at a time, all adjacent to
each other.  With the one big table and one big two way index method, we
were getting linearly increasing select times as the dataset grew larger
and larger.  The indexes were much larger than available memory and
shared buffers.  The retrieval time for 50-80k rows was on the order of
2 to 6 seconds, while the retrieval time for the same number of rows
with 400 partitions was about 0.2 to 0.5 seconds.

I haven't tested with more partitions than that, but might if I get a
chance.  What was really slow was the inserts since I was using rules at
the time.  I'd like to try re-writing it to use triggers, since I would
then have one trigger on the parent table instead of 400 rules.  Or I
could imbed the rules into the app that was creating / inserting the
data.  The insert performance dropped off VERY fast as I went over 100
rules, and that was what primarily stopped me from testing larger
numbers of partitions.

The select performance stayed very fast with more partitions, so I'm
guessing that the constraint exclusion is pretty well optimized.

I'll play with it some more when I get a chance.  For certain operations
like the one we were testing, partitioning seems to pay off big time.

Re: Thousands of tables versus on table?

From
Craig James
Date:
david@lang.hm wrote:
> various people (not database experts) are pushing to install Oracle
> cluster so that they can move all of these to one table with a
> customerID column.

They're blowing smoke if they think Oracle can do this.  One of my applications had this exact same problem --
table-per-customerversus big-table-for-everyone.  Oracle fell over dead, even with the best indexing possible, tuned by
theexperts, and using partitions keyed to the customerID. 

We ended up breaking it up into table-per-customer because Oracle fell over dead when we had to do a big update on a
customer'sentire dataset.  All other operations were slowed by the additional index on the customer-ID, especially
complexjoins.  With a table-for-everyone, you're forced to create tricky partitioning or clustering, clever indexes,
andeven with that, big updates are problematic.  And once you do this, then you become heavily tied to one RDBMS and
yourapplications are no longer portable, because clustering, indexing, partitioning and other DB tuning tricks are very
specificto each RDBMS. 

When we moved to Postgres, we never revisited this issue, because both Oracle and Postgres are able to handle thousands
oftables well.  As I wrote in a previous message on a different topic, often the design of your application is more
importantthan the performance.  In our case, the table-per-customer makes the applications simpler, and security is
MUCHeasier. 

Oracle is simply not better than Postgres in this regard.  As far as I know, there is only one specific situation
(discussedfrequently here) where Oracle is faster: the count(), min() and max() functions, and I know significant
progresshas been made since I started using Postgres.  I have not found any other query where Oracle is significantly
better,and I've found several where Postgres is the clear winner. 

It's telling that Oracle's license contract prohibits you from publishing comparisons and benchmarks.  You have to
wonderwhy. 

Craig

Re: Thousands of tables versus on table?

From
Scott Marlowe
Date:
Craig James wrote:
>
> Oracle is simply not better than Postgres in this regard.  As far as I
> know, there is only one specific situation (discussed frequently here)
> where Oracle is faster: the count(), min() and max() functions, and I
> know significant progress has been made since I started using
> Postgres.  I have not found any other query where Oracle is
> significantly better, and I've found several where Postgres is the
> clear winner.
In my testing between a commercial database that cannot be named and
postgresql, I found max() / min() to be basically the same, even with
where clauses and joins happening.

count(*), OTOH, is a still a clear winner for the big commercial
database.  With smaller sets (1 Million or so) both dbs are in the same
ballpark.

With 30+million rows, count(*) took 2 minutes on pgsql and 4 seconds on
the big database.

OTOH, there are some things, like importing data, which are MUCH faster
in pgsql than in the big database.

Re: Thousands of tables versus on table?

From
Craig James
Date:
Scott Marlowe wrote:
> OTOH, there are some things, like importing data, which are MUCH faster
> in pgsql than in the big database.

An excellent point, I forgot about this. The COPY command is the best thing since the invention of a shirt pocket.  We
havea database-per-customer design, and one of the mosterous advantages of Postgres is that we can easily do backups.
Apg_dump, then scp to a backup server, and in just a minute or two we have a full backup.  For recovery, pg_restore is
equallyfast and amazing.  Last time I checked, Oracle didn't have anything close to this. 

Craig



Re: Thousands of tables versus on table?

From
"Jonah H. Harris"
Date:
On 6/6/07, Craig James <craig_james@emolecules.com> wrote:
> They're blowing smoke if they think Oracle can do this.

Oracle could handle this fine.

> Oracle fell over dead, even with the best indexing possible,
> tuned by the experts, and using partitions keyed to the
> customerID.

I don't think so, whoever tuned this likely didn't know what they were doing.

> It's telling that Oracle's license contract prohibits you from
> publishing comparisons and benchmarks.  You have to wonder why.

They did this for the same reason as everyone else.  They don't want
non-experts tuning the database incorrectly, writing a benchmark paper
about it, and making the software look bad.

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation            | fax: 732.331.1301
33 Wood Ave S, 3rd Floor            | jharris@enterprisedb.com
Iselin, New Jersey 08830            | http://www.enterprisedb.com/

Re: Thousands of tables versus on table?

From
"Jonah H. Harris"
Date:
On 6/6/07, Craig James <craig_james@emolecules.com> wrote:
> Last time I checked, Oracle didn't have anything close to this.

When did you check, 15 years ago?  Oracle has direct-path
import/export and data pump; both of which make generic COPY look like
a turtle.  The new PostgreSQL bulk-loader takes similar concepts from
Oracle and is fairly faster than COPY.

Don't get me wrong, I'm pro-PostgreSQL... but spouting personal
observations on other databases as facts just boasts an
PostgreSQL-centric egotistical view of the world.  If you don't tune
Oracle, it will suck.  If you don't understand Oracle architecture
when you tune an application, it will suck; just like PostgreSQL.
People who don't have extensive experience in the other databases just
hear what you say and regurgitate it as fact; which it is not.

Look at how many people in these lists still go on and on about MySQL
flaws based on their experience with MySQL 3.23.  Times change and it
doesn't do anyone any good to be ignorant of other databases.  If
you're going to speak about another database in a comparison, please
stay current or specify the database you're comparing against.

This is nothing against you, but it always starts an avalanche of,
"look how perfect we are compared to everyone else."

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation            | fax: 732.331.1301
33 Wood Ave S, 3rd Floor            | jharris@enterprisedb.com
Iselin, New Jersey 08830            | http://www.enterprisedb.com/

Re: Thousands of tables versus on table?

From
Andrew Sullivan
Date:
On Wed, Jun 06, 2007 at 12:06:09AM +0200, Steinar H. Gunderson wrote:

> Wasn't there also talk about adding the ability to mark individual
> partitions as read-only, thus bypassing MVCC and allowing queries
> to be satisfied using indexes only?

I have a (different) problem that read-only data segments (maybe
partitions, maybe something else) would help, so I know for sure that
someone is working on a problem like this, but I don't think it's the
sort of thing that's going to come any time soon.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what
you told them to.  That actually seems sort of quaint now.
        --J.D. Baldwin

Re: Thousands of tables versus on table?

From
Andrew Sullivan
Date:
On Tue, Jun 05, 2007 at 03:31:55PM -0700, david@lang.hm wrote:
> various people (not database experts) are pushing to install Oracle
> cluster so that they can move all of these to one table with a customerID
> column.

Well, you will always have to deal with the sort of people who will
base their technical prescriptions on the shiny ads they read in
SuperGlobalNetworkedExecutiveGoFast, or whatever rag they're reading
these days.  I usually encourage such people actually to perform the
analysis of the license, salary, contingency, and migrations costs
(and do a similar analysis myself, actually, so when they have
overlooked the 30 things that individually cost $1million a piece, I
can point them out).  More than one jaw has had to be picked up off
the floor when presented with the bill for RAC.  Frequently, people
discover that it is a good way to turn your tidy money-making
enterprise into a giant money hole that produces a sucking sound on
the other end of which is Oracle Corporation.

All of that aside, I have pretty severe doubts that RAC would be a
win for you.  A big honkin' single database in Postgres ought to be
able to do this too, if you throw enough hardware money at it.  But
it seems a waste to re-implement something that's already apparently
working for you in favour of something more expensive that you don't
seem to need.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
When my information changes, I alter my conclusions.  What do you do sir?
        --attr. John Maynard Keynes

Re: Thousands of tables versus on table?

From
"Jonah H. Harris"
Date:
On 6/6/07, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
> Well, you will always have to deal with the sort of people who will
> base their technical prescriptions on the shiny ads they read in
> SuperGlobalNetworkedExecutiveGoFast, or whatever rag they're reading
> these days.

Always.

> I usually encourage such people actually to perform the
> analysis of the license, salary, contingency, and migrations costs

Yes, this is the best way.

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation            | fax: 732.331.1301
33 Wood Ave S, 3rd Floor            | jharris@enterprisedb.com
Iselin, New Jersey 08830            | http://www.enterprisedb.com/

control of benchmarks (was: Thousands of tables)

From
Andrew Sullivan
Date:
On Wed, Jun 06, 2007 at 02:01:59PM -0400, Jonah H. Harris wrote:
> They did this for the same reason as everyone else.  They don't want
> non-experts tuning the database incorrectly, writing a benchmark paper
> about it, and making the software look bad.

I agree that Oracle is a fine system, and I have my doubts about the
likelihood Oracle will fall over under fairly heavy loads.  But I
think the above is giving Oracle Corp a little too much credit.

Corporations exist to make money, and the reason they prohibit doing
anything with their software and then publishing it without their
approval is because they want to control all the public perception of
their software, whether deserved or not.  Every user of any large
software system (Oracle or otherwise) has their favourite horror
story about the grotty corners of that software;
commercially-licensed people just aren't allowed to prove it in
public.  It's not only the clueless Oracle is protecting themselves
against; it's also the smart, accurate, but expensive corner-case
testers.  I get to complain that PostgreSQL is mostly fast but has
terrible outlier performance problems.  I can think of another system
that I've used that certainly had a similar issue, but I couldn't
show you the data to prove it.  Everyone who used it knew about it,
though.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
A certain description of men are for getting out of debt, yet are
against all taxes for raising money to pay it off.
        --Alexander Hamilton

Re: control of benchmarks (was: Thousands of tables)

From
"Jonah H. Harris"
Date:
On 6/6/07, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
> But I think the above is giving Oracle Corp a little too
> much credit.

Perhaps.  However, Oracle has a thousand or so knobs which can control
almost every aspect of every subsystem.  If you know how they interact
with each other and how to use them properly, they can make a huge
difference in performance.  Most people do not know all the knobs or
understand what difference each can make given the theory and
architecture of the system, which results in poor general
configurations.  Arguably, there is a cost associated with having
someone staffed and/or consulted that has the depth of knowledge
required to tune it in such a manner which goes back to a basic
cost/benefit analysis.

Oracle, while seeming like a one-size-fits-all system, has the same
basic issue as PostgreSQL and everyone else; to get optimum
performance, it has to be tuned specifically for the
application/workload at hand.

> Corporations exist to make money, and the reason they prohibit doing
> anything with their software and then publishing it without their
> approval is because they want to control all the public perception of
> their software, whether deserved or not.

Of course.  Which is why audited benchmarks like SPEC and TPC are
around.  While they may not represent one's particular workload, they
are the only way to fairly demonstrate comparable performance.

> Every user of any large software system (Oracle or otherwise)
> has their favourite horror story about the grotty corners of
> that software;

Of course, but they also never say why it was caused.  With Oracle,
almost all bad-performance cases I've seen are related to improper
tuning and/or hardware; even by experienced DBAs.

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation            | fax: 732.331.1301
33 Wood Ave S, 3rd Floor            | jharris@enterprisedb.com
Iselin, New Jersey 08830            | http://www.enterprisedb.com/

Re: Thousands of tables versus on table?

From
Craig James
Date:
Jonah H. Harris wrote:
> On 6/6/07, Craig James <craig_james@emolecules.com> wrote:
>> They're blowing smoke if they think Oracle can do this.
>
> Oracle could handle this fine.
>
>> Oracle fell over dead, even with the best indexing possible,
>> tuned by the experts, and using partitions keyed to the
>> customerID.
>
> I don't think so, whoever tuned this likely didn't know what they were
> doing.

Wrong on both counts.

You didn't read my message.  I said that *BOTH* Oracle and Postgres performed well with table-per-customer.  I wasn't
Oraclebashing.  In fact, I was doing the opposite: Someone's coworker claimed ORACLE was the miracle cure for all
problems,and I was simply pointing out that there are no miracle cures.  (I prefer Postgres for many reasons, but
Oracleis a fine RDBMS that I have used extensively.) 

The technical question is simple: Table-per-customer or big-table-for-everyone.  The answer is, "it depends."  It
dependson your application, your read-versus-write ratio, the table size, the design of your application software, and
adozen other factors.  There is no simple answer, but there are important technical insights which, I'm happy to
report,various people contributed to this discussion.  Perhaps you have some technical insight too, because it really
isan important question. 

The reason I assert (and stand by this) that "They're blowing smoke" when they claim Oracle has the magic cure, is
becauseOracle and Postgres are both relational databases, they write their data to disks, and they both have indexes
withO(log(N)) retrieval/update times.  Oracle doesn't have a magical workaround to these facts, nor does Postgres. 

Craig

Re: Thousands of tables versus on table?

From
"Jonah H. Harris"
Date:
On 6/6/07, Craig James <craig_james@emolecules.com> wrote:
> You didn't read my message.  I said that *BOTH* Oracle
> and Postgres performed well with table-per-customer.

Yes, I did.  My belief is that Oracle can handle all customers in a
single table.

> The technical question is simple: Table-per-customer or
> big-table-for-everyone.  The answer is, "it depends."

I agree, it does depend on the data, workload, etc.  No
one-size-fits-all answer there.

> The reason I assert (and stand by this) that "They're
> blowing smoke" when they claim Oracle has the magic
> cure, is because Oracle and Postgres are both relational
> databases, they write their data to disks, and they both
> have indexes with O(log(N)) retrieval/update times.  Oracle
> doesn't have a magical workaround to these facts,
> nor does Postgres.

Agreed that they are similar on the basics, but they do use
significantly different algorithms and optimizations.  Likewise, there
is more tuning that can be done with Oracle given the amount of time
and money one has to spend on it.  Again, cost/benefit analysis on
this type of an issue... but you're right, there is no "magic cure".

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation            | fax: 732.331.1301
33 Wood Ave S, 3rd Floor            | jharris@enterprisedb.com
Iselin, New Jersey 08830            | http://www.enterprisedb.com/

Re: Thousands of tables versus on table?

From
david@lang.hm
Date:
On Wed, 6 Jun 2007, Scott Marlowe wrote:

>> >  pretty doggone cheap relative to the table searches, which means it
>> >  almost certainly will lose badly if you carry the subdivision out to
>> >  the extent that the individual tables become small.  (This last could
>>                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>  what is considered 'small'? a few thousand records, a few million records?
>
> I would say small is when the individual tables are in the 10 to 20 Megabyte
> range.  How many records that is depends on record width, of course.
> Basically, once the tables get small enough that you don't really need
> indexes much, since you tend to grab 25% or more of each one that you're
> going to hit in a query.

thanks, that makes a lot of sense

>>  what multiplication factor would there need to be on the partitioning to
>>  make it worth while? 100 tables, 1000 tables, 10000 tables?
> Really depends on the size of the master table I think.  If the master table
> is about 500 Megs in size, and you partition it down to about 1 meg per child
> table, you're probably ok.  Walking through 500 entries for constraint
> exclusion seems pretty speedy from the tests I've run on a 12M row table that
> was about 250 Megs, split into 200 to 400 or so equisized child tables.  The
> time to retrieve 85,000 rows that were all neighbors went from 2 to 6
> seconds, to about 0.2 seconds, and we got rid of indexes entirely since they
> weren't really needed anymore.

remember, I'm talking about a case wher eyou don't have to go through
contraint checking. you know to start with what customerID you are dealing
with so you just check the tables for that customer

>>  the company that I'm at started out with a seperate database per customer
>>  (not useing postgres), there are basicly zero cross-customer queries, with
>>  a large volume of updates and lookups.
>>
>>  overall things have now grown to millions of updates/day (some multiple of
>>  this in lookups), and ~2000 customers, with tens of millions of rows
>>  between them.
>>
>>  having each one as a seperate database has really helped us over the years
>>  as it's made it easy to scale (run 500 databases on each server instead of
>>  1000, performance just doubled)
> I think that for what you're doing, partitioning at the database level is
> probably a pretty good compromise solution.  Like you say, it's easy to put
> busy databases on a new server to balance out the load.  Hardware is cheap.
>
>>  various people (not database experts) are pushing to install Oracle
>>  cluster so that they can move all of these to one table with a customerID
>>  column.
> Have these people identified a particular problem they're trying to solve, or
> is this a religious issue for them?  From your description it sounds like a
> matter of dogma, not problem solving.

in part it is, in part it's becouse the commercial database companies have
told management that doing database replication is impossible with so many
databases (we first heard this back when we had 300 or so databases),
we've gone the expensive EMC disk-layer replication route, but they think
that mergeing everything will simplify things somehow so the database can
do it's job better.

I see it as just a limitation on the replication solution offered by the
bigname vendors.

>>  the database folks won't comment much on this either way, but they don't
>>  seem enthusiastic to combine all the data togeather.
> I think they can see the fecal matter heading towards the rotational cooling
> device on this one.  I can't imagine this being a win from the perspective of
> saving the company money.

neither do I.

David Lang