Thread: One large v. many small

One large v. many small

From
Noah Silverman
Date:
As we continue our evaluation of Postgres, another interesting topic
has come up that I want to run by the group.

In our current model, we have about 3,000 small tables that we use
track data for our clients.  Each table is an identical structure, and
holds the data for one client.

Another idea that we are considering is one big table instead of 3,000
smaller ones.  We could simply add a numeric field to indicate which
client a particular record was for.

Each table has between 500 and 50,000 records, so the big table could
have up to 10 million rows if we combined everything.


A query on our current system is (for client #4)

Select (*) from client_4 where foo=2;

A query from the new, proposed system would be

Select (*) from big_results where client=4 and foo=2.

The big questions is, WHICH WILL BE FASTER with Postgres.  Is there any
performance improvement or cost to switching to this new structure.


ANY AND ALL FEEDBACK/OPINIONS ARE WELCOME!!

Thanks,

Noah


Re: One large v. many small

From
Josh Berkus
Date:
Noah,

> As we continue our evaluation of Postgres, another interesting topic
> has come up that I want to run by the group.
>
> In our current model, we have about 3,000 small tables that we use
> track data for our clients.  Each table is an identical structure, and
> holds the data for one client.

I'd list what's wrong with this structure, but frankly it would take me long
enough that I'd need a consulting fee.   Suffice it to say that the above is
a very, very bad (or at least antiquated) design idea and you need to
transition out of it as soon as possible.

> Another idea that we are considering is one big table instead of 3,000
> smaller ones.  We could simply add a numeric field to indicate which
> client a particular record was for.

Yes.   Absolutely.  Although I'd suggest an Integer field.

> Each table has between 500 and 50,000 records, so the big table could
> have up to 10 million rows if we combined everything.

Sure.

> A query on our current system is (for client #4)
>
> Select (*) from client_4 where foo=2;
>
> A query from the new, proposed system would be
>
> Select (*) from big_results where client=4 and foo=2.
>
> The big questions is, WHICH WILL BE FASTER with Postgres.  Is there any
> performance improvement or cost to switching to this new structure.

Oh, no question query 1 will be faster ... FOR THAT QUERY.  You are asking the
wrong question.

However, explain to me how, under the current system, you can find the client
who ordered $3000 worth of widgets on January 12th if you don't already know
who it is?   I'm not sure a 3000-table UNION query is even *possible*.

Or how about giving me the average number of customer transactions in a month,
across all clients?

<rant>

You've enslaved your application design to performance considerations ... an
approach which was valid in 1990, because processing power was so limited
then.  But now that dual-processor servers with RAID can be had for less than
$3000, there's simply no excuse for violating the principles of good
relational database design just to speed up a query.   Buying more RAM is
much cheaper than having an engineer spend 3 weeks fixing data integrity
problems.

The proper way to go about application design is to build your application on
paper or in a modelling program according to the best principles of software
design available, and *then* to discuss performance issues -- addressing them
*first* by buying hardware, and only compromising your applcation design when
no other alternative is available.

</rant>

I strongly suggest that you purchase Pascal's "Practical Issues in Database
Design" and give it a read.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: One large v. many small

From
Andrew Sullivan
Date:
On Thu, Jan 30, 2003 at 12:34:36PM -0500, Noah Silverman wrote:
> Select (*) from client_4 where foo=2;
>
> A query from the new, proposed system would be
>
> Select (*) from big_results where client=4 and foo=2.
>
> The big questions is, WHICH WILL BE FASTER with Postgres.  Is there any
> performance improvement or cost to switching to this new structure.

Faster overall, or faster for that operation?  I can't prove it, but
I suspect that the first one will return faster just because both the
index and the table itself is smaller.

The possibility is thatit will cause you problems overall, however,
because of the large number of files you have to keep if you use 3000
tables.  This is dependent on your filesytem (and its
implementation).

Note, too, that a lot of transactions frequently updating the table
might make a difference.  A large number of dead tuples sitting on a
10 million row table will make anything crawl.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: One large v. many small

From
Noah Silverman
Date:
OK,

Thanks for the quick responses.

A bit more information.

We are in the business of gathering data for our clients. (We're a news
service).  Subsequently, we do a lot of inserting and very rarely do
any deleting. (We periodically clear out results that are over 6 months
old.)

On a give day, we will insert around 100,000 records in total.
(Currently split across all the client tables).

A challenging part of the process is that we have to keep track of
previous content that may be similar.  We CAN'T do this with a unique
index (don't ask, it would take too long to explain, but trust me, it
isn't possible).  So, we have to query the table first and then compare
the results of that query to what we are inserting.  SO, we probably do
close to 1 million queries, but then only make about 100,000 inserts.
The basic flow is 1) our system finds something it likes, 2) query the
table to see if something similar already exists, 3) if nothing similar
exists, insert.

While all this is going on, our clients are accessing our online
reporting system.  This system makes a variety of count and record
requests from the database.

As I mentioned in our earlier post, we are attempting to decide if
Postgres will run faster/better/ with one big table, or a bunch of
smaller ones.  It really doesn't make much difference to us, we just
want whatever structure will be faster.

Thanks,

-N


Re: One large v. many small

From
Jeff
Date:
On Thu, 30 Jan 2003, Josh Berkus wrote:

>
> The proper way to go about application design is to build your application on
> paper or in a modelling program according to the best principles of software
> design available, and *then* to discuss performance issues -- addressing them
> *first* by buying hardware, and only compromising your applcation design when
> no other alternative is available.
>

App design & performance go hand-in-hand. the trick is to balance them.
Who wants a _wonderful_ design that runs like a piece of poo?  in this
case I agree with you - not the best design around.  buying hardware to
fix speed problems is useful, but the software side should not be
neglected -  imagine this scenario using your
methods (with a wonderful pg performance problem in hand (unless you are
running cvs))

User has a schema and writes a query along the lines of

select somevalue from sometable where othervalue not in (select badvalues
from badvaluetable where id = 12345)

we all know this runs horrifically on postgres. using your method I should
go out, spend thousands on multi-processor boxes, raid, ram

If you do a little app tuning (maybe spend 10-30 minutes readig pgsql
archives) you'll learn to rewrite it as an exists query and make it faster
than it ever could have been on the fast hardware. I just saved the
company $10k too! (depends on if you consider that change a design
change).. some designs are fatally flawed from the start. but hey.. oh
well.

'tis a fine line though.. balancing hardware vs software optimization.
(I'm also guessing they are not constrained by things such as an embedded
system too)


------------------------------------------------------------------------------
Jeff Trout <jeff@jefftrout.com>                  http://www.jefftrout.com/
   Ronald McDonald, with the help of cheese soup,
       controls America from a secret volkswagon hidden in the past
-------------------------------------------------------------------------------



Re: One large v. many small

From
Robert Treat
Date:
I'm going to go against the grain here and say that if you already have
all of the code and schema worked out, you probably should stick with
the many table design. While there are many reasons you'd be better off
with the one big table design, a speed increase really isn't one of
them. If you we're starting from scratch, or even had a slew of
development work you we're planning to do, I'd probably recommend the
one big table approach, but if you don't have any bottlenecks in your
current system and the type of query you've given is typical of the
majority of what your application is doing, there's no sense redesigning
your application in the middle of a database switch.

Robert Treat

PS. Josh, are you referring to Pascal's "Practical Issues In Database
Management" book or does he have a different book out that I'm not aware
of?

On Thu, 2003-01-30 at 13:24, Noah Silverman wrote:
> OK,
>
> Thanks for the quick responses.
>
> A bit more information.
>
> We are in the business of gathering data for our clients. (We're a news
> service).  Subsequently, we do a lot of inserting and very rarely do
> any deleting. (We periodically clear out results that are over 6 months
> old.)
>
> On a give day, we will insert around 100,000 records in total.
> (Currently split across all the client tables).
>
> A challenging part of the process is that we have to keep track of
> previous content that may be similar.  We CAN'T do this with a unique
> index (don't ask, it would take too long to explain, but trust me, it
> isn't possible).  So, we have to query the table first and then compare
> the results of that query to what we are inserting.  SO, we probably do
> close to 1 million queries, but then only make about 100,000 inserts.
> The basic flow is 1) our system finds something it likes, 2) query the
> table to see if something similar already exists, 3) if nothing similar
> exists, insert.
>
> While all this is going on, our clients are accessing our online
> reporting system.  This system makes a variety of count and record
> requests from the database.
>
> As I mentioned in our earlier post, we are attempting to decide if
> Postgres will run faster/better/ with one big table, or a bunch of
> smaller ones.  It really doesn't make much difference to us, we just
> want whatever structure will be faster.
>
> Thanks,
>
> -N
>



Re: One large v. many small

From
"Chad Thompson"
Date:
I have a database running on PostgresSQL w/ close to 7 million records in
one table and ~ 3 million in another, along w/ various smaller supportive
tables.
Before I started here everything was run out of small tables, one for each
client, similar ( i think ) to what you are doing now.
We submit ~ 50 - 100K records each week. And before we moved to one table,
our company had no idea of how it was doing on a daily, weekly or monthly
basis.  Now that we have moved to one large structure, new ideas about
reporting funtions and added services we can give to our clients are poping
up all the time.

There are MANY benifts to following Josh's advice and putting all your
information in one table.  Others than those given, what if you wanted to
give an added service to your clients where they are made aware of similar
postings by your other clients.  Running this kind of report would be a
nightmare in your current situation.

As far as performance goes, I am able to join these 2 tables along w/ others
and get the information, counts etc., that I need, using some rather
complicated queries, in about 2-3 seconds per query.  While this may sound
awful realize that Im running on a standard workstation PIII 700, and for
the money, Its a dream!

More importantly you need to realize, as my coworkers have now done, that
anything that you can do w/ a small table, you can do w/ one big table and
an extra line in the where clause (eg. Where client_id = 'blah' ).
PostgresSQL has wonderful support and many excellent DBA's that if you post
a SQL problem they are very supportive in helping solve the problem.

I hope this helps make your decision.
Thanks
Chad


----- Original Message -----
From: "Noah Silverman" <noah@allresearch.com>
To: <pgsql-performance@postgresql.org>
Cc: <pgsql-performance@postgresql.org>
Sent: Thursday, January 30, 2003 11:24 AM
Subject: Re: [PERFORM] One large v. many small


> OK,
>
> Thanks for the quick responses.
>
> A bit more information.
>
> We are in the business of gathering data for our clients. (We're a news
> service).  Subsequently, we do a lot of inserting and very rarely do
> any deleting. (We periodically clear out results that are over 6 months
> old.)
>
> On a give day, we will insert around 100,000 records in total.
> (Currently split across all the client tables).
>
> A challenging part of the process is that we have to keep track of
> previous content that may be similar.  We CAN'T do this with a unique
> index (don't ask, it would take too long to explain, but trust me, it
> isn't possible).  So, we have to query the table first and then compare
> the results of that query to what we are inserting.  SO, we probably do
> close to 1 million queries, but then only make about 100,000 inserts.
> The basic flow is 1) our system finds something it likes, 2) query the
> table to see if something similar already exists, 3) if nothing similar
> exists, insert.
>
> While all this is going on, our clients are accessing our online
> reporting system.  This system makes a variety of count and record
> requests from the database.
>
> As I mentioned in our earlier post, we are attempting to decide if
> Postgres will run faster/better/ with one big table, or a bunch of
> smaller ones.  It really doesn't make much difference to us, we just
> want whatever structure will be faster.
>
> Thanks,
>
> -N
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>


Re: One large v. many small

From
Curt Sampson
Date:
On Thu, 30 Jan 2003, Josh Berkus wrote:

> > Another idea that we are considering is one big table instead of 3,000
> > smaller ones.  We could simply add a numeric field to indicate which
> > client a particular record was for.
>
> Yes.   Absolutely.  Although I'd suggest an Integer field.

From the description given in Noah's message, and also the one given in
his later message, I have little doubt that 3000 small tables are going
to be significantly faster than one large table. If you don't believe
me, work out just where the disk blocks are going to end up, and how
many blocks are going to have to be fetched for his typical query in
a semi-clustered or non-clustered table. (If postgres had clustered
indexes a la MS SQL server, where the rows are physically stored in the
order of the clustered index, it would be a different matter.)

> However, explain to me how, under the current system, you can find the client
> who ordered $3000 worth of widgets on January 12th if you don't already know
> who it is?

Explain to me why he has to do this.

It's all very nice to have a general system that can do well on all
sorts of queries, but if you lose time on the queries you do do, in
order to save time on queries you don't do, you're definitely not
getting the best performance out of the system.

> I'm not sure a 3000-table UNION query is even *possible*.

This is not the only solution, either. You could simply just do 3000
queries. If this is something you execute only once a month, the making
that query three or four orders of magnitude more expensive might be a
small price to pay for making cheaper the queries you run several times
per second.

> <rant>
>
> You've enslaved your application design to performance considerations ... an
> approach which was valid in 1990, because processing power was so limited
> then.  But now that dual-processor servers with RAID can be had for less than
> $3000, there's simply no excuse for violating the principles of good
> relational database design just to speed up a query.  Buying more RAM is
> much cheaper than having an engineer spend 3 weeks fixing data integrity
> problems.

*Sigh.* Ok, my turn to rant.

RAM is not cheap enough yet for me to buy several hundred gigabytes of
it for typical applications, even if I could find a server that I could
put it in. Disk performance is not growing the way CPU performance is.
And three weeks of engineering time plus a ten thousand dollar server
is, even at my top billing rate, still a heck of a lot cheaper than a
quarter-million dollar server.

Applying your strategy to all situations is not always going to produce
the most cost-effective solution. And for most businesses, that's what it's
all about. They're not interested in the more "thoretically pure" way of
doing things except insofar as it makes them money.

As for the data integrity problems, I don't know where that came from. I
think that was made up out of whole cloth, because it didn't seem to me
that the original question involved any.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC

Re: One large v. many small

From
"Josh Berkus"
Date:
Noah,

Well, there you have it:  a unanimous consensus of opinion.   You
should either combine all of your tables or not.  But definitely one or
the other.

<grin>

Hope you feel informed now.

-Josh

Re: One large v. many small

From
"Shridhar Daithankar"
Date:
On Thursday 30 Jan 2003 11:54 pm, you wrote:
> As I mentioned in our earlier post, we are attempting to decide if
> Postgres will run faster/better/ with one big table, or a bunch of
> smaller ones.  It really doesn't make much difference to us, we just
> want whatever structure will be faster.

I would say create a big table with client id. Create a index on it and create
3000 views. Of course you need to figure out SQL voodoo to insert into
postgresql views using rules.

But that would save you from modifying your app. up and down. But there is
going to be massive framgmentation. Consider clustering tables once in a
while.

 HTH

 Shridhar

Re: One large v. many small

From
Josh Berkus
Date:
Folks,

Many, many replies on this topic:

Jeff:
> App design & performance go hand-in-hand. the trick is to balance them.
> Who wants a _wonderful_ design that runs like a piece of poo?  in this
<snip>
>Select somevalue from sometable where othervalue not in (select badvalues
> from badvaluetable where id = 12345)
> we all know this runs horrifically on postgres. using your method I should
> go out, spend thousands on multi-processor boxes, raid, ram

Sorry, no, Jeff.   The above is what one calls a "bad query" and is not,
therefore, a performance vs. design issue: that query is bad design-wise, and
bad performance-wise.  Perhpas another example of your argument?

Since you do not seem to have understood my argument, it is this:
Design changes, made for the sake of performance or rapid app building, which
completely violate good RDBMS design and normalization principles, almost
always cost you more over the life of the application than you gain in
performance in the short term.

Curt:
> It's all very nice to have a general system that can do well on all
> sorts of queries, but if you lose time on the queries you do do, in
> order to save time on queries you don't do, you're definitely not
> getting the best performance out of the system.

This is a good point; I tend to build for posterity because, so far, 90% of my
clients who started out having me build a "single-purpose" database ended up
expanding the application to cover 2-10 additional needs, thus forcing me to
clean up any design shortcuts I took with the original app.    However, Noah
may have more control over his company than that.

<and>
> RAM is not cheap enough yet for me to buy several hundred gigabytes of
> it for typical applications, even if I could find a server that I could
> put it in. Disk performance is not growing the way CPU performance is.
> And three weeks of engineering time plus a ten thousand dollar server
> is, even at my top billing rate, still a heck of a lot cheaper than a
> quarter-million dollar server.

I was thinking more of the difference between a $3000 server and a $9000
server myself; unless you're doing nuclear test modelling, I don't see any
need for a $250,000 server for anything.
To give an extreme example, I have a client who purchased a $150,000
accounting system that turned out to be badly designed, normalization-wise,
partly because the accounting system engineers were focusing on 8-year-old
technology with performance restrictions which were no longer really
applicable (for example, they talked the client into buying a quad-processor
server and then wrote all of their middleware code on a platform that does
not do SMP).  Over the last two years, they have paid my company $175,000 to
"fix" this accounting database ... more, in fact, than I would have charged
them to write a better system from scratch.

<and>
> Applying your strategy to all situations is not always going to produce
> the most cost-effective solution.

That's very true.   In fact, that could be taken as a "general truism" ... no
one strategy applies to *all* situations.

> PS. Josh, are you referring to Pascal's "Practical Issues In Database
> Management" book or does he have a different book out that I'm not aware
> of?

Yes, you're correct.  Sorry!

--
Josh Berkus
Aglio Database Solutions
San Francisco


Re: One large v. many small

From
"Curtis Faith"
Date:
Curt Sampson wrote:
> >From the description given in Noah's message, and also the
> one given in his later message, I have little doubt that 3000
> small tables are going to be significantly faster than one
> large table. If you don't believe me, work out just where the
> disk blocks are going to end up, and how many blocks are going
> to have to be fetched for his typical query in a semi-clustered or
> non-clustered table.

You may be right, Curt, but I've seen unintuitive results for this
kind of thing in the past.

Depending on the way the records are accessed and the cache size,
the exact opposite could be true. The index pages will most likely
rarely be in memory when you have 3000 different tables. Meaning
that each search will require at least three or four index page
retrievals plus the tuple page.

So what you might lose due to lack of clustering will be made up
by the more efficient caching of the upper levels of the index
btree pages.

Combine a multi-part index (on both client and foo, which order
would depend on the access required) that is clustered once a week
or so using the admittedly non-optimal PostgreSQL CLUSTER command
and I'll bet you can get equivalent or better performance with the
single table with the concomitant advantages of much better
reporting options.

I've also  seen many examples of linear algorithms in database
data dictionaries which would cause a 3000+ table database
to perform poorly during the parsing/query optimization stage.
I don't have any idea whether or not PostgreSQL suffers from this
problem.

I don't think there is any substitute for just trying it out. It
shouldn't be that hard to create a bunch of SQL statements that
concatenate the tables into one large one.

Try the most common queries against both scenarios. You might be
surprised.

- Curtis



Re: One large v. many small

From
Curt Sampson
Date:
On Fri, 31 Jan 2003, Curtis Faith wrote:

> Depending on the way the records are accessed and the cache size,
> the exact opposite could be true. The index pages will most likely
> rarely be in memory when you have 3000 different tables. Meaning
> that each search will require at least three or four index page
> retrievals plus the tuple page.

Assuming you're using indexes at all. If you're tending to use table
scans, this doesn't matter.

From Noah's description it seemed he was--he said that a particular data
item couldn't be the primary key, presumably because he couldn't index
it reasonably. But this just my guess, not a fact.

> Combine a multi-part index (on both client and foo, which order
> would depend on the access required) that is clustered once a week
> or so using the admittedly non-optimal PostgreSQL CLUSTER command
> and I'll bet you can get equivalent or better performance...

I would say that, just after a CLUSTER, you're likely to see better
performance because this would have the effect, on a FFS or similar
filesystem where you've got plenty of free space, of physically
clustering data that would not have been clustered in the case of a lot
of small tables that see a lot of appending evenly over all of them over
the course of time.

So the tradeoff there is really, can you afford the time for the CLUSTER?
(In a system where you have a lot of maintenance time, probably. Though if
it's a huge table, this might need an entire weekend. In a system that needs
to be up 24/7, probably not, unless you have lots of spare I/O capacity.)
Just out of curiousity, how does CLUSTER deal with updates to the table while
the CLUSTER command is running?

> I don't think there is any substitute for just trying it out. It
> shouldn't be that hard to create a bunch of SQL statements that
> concatenate the tables into one large one.

I entirely agree! There are too many unknowns here to do more than
speculate on this list.

But thanks for enlightening me on situations where one big table perform
better.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC

Re: One large v. many small

From
Bruce Momjian
Date:
Curt Sampson wrote:
> So the tradeoff there is really, can you afford the time for the CLUSTER?
> (In a system where you have a lot of maintenance time, probably. Though if
> it's a huge table, this might need an entire weekend. In a system that needs
> to be up 24/7, probably not, unless you have lots of spare I/O capacity.)
> Just out of curiousity, how does CLUSTER deal with updates to the table while
> the CLUSTER command is running?

CLUSTER locks the table, so no updates can happen during a cluster.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073