Thread: One large v. many small
As we continue our evaluation of Postgres, another interesting topic has come up that I want to run by the group. In our current model, we have about 3,000 small tables that we use track data for our clients. Each table is an identical structure, and holds the data for one client. Another idea that we are considering is one big table instead of 3,000 smaller ones. We could simply add a numeric field to indicate which client a particular record was for. Each table has between 500 and 50,000 records, so the big table could have up to 10 million rows if we combined everything. A query on our current system is (for client #4) Select (*) from client_4 where foo=2; A query from the new, proposed system would be Select (*) from big_results where client=4 and foo=2. The big questions is, WHICH WILL BE FASTER with Postgres. Is there any performance improvement or cost to switching to this new structure. ANY AND ALL FEEDBACK/OPINIONS ARE WELCOME!! Thanks, Noah
Noah, > As we continue our evaluation of Postgres, another interesting topic > has come up that I want to run by the group. > > In our current model, we have about 3,000 small tables that we use > track data for our clients. Each table is an identical structure, and > holds the data for one client. I'd list what's wrong with this structure, but frankly it would take me long enough that I'd need a consulting fee. Suffice it to say that the above is a very, very bad (or at least antiquated) design idea and you need to transition out of it as soon as possible. > Another idea that we are considering is one big table instead of 3,000 > smaller ones. We could simply add a numeric field to indicate which > client a particular record was for. Yes. Absolutely. Although I'd suggest an Integer field. > Each table has between 500 and 50,000 records, so the big table could > have up to 10 million rows if we combined everything. Sure. > A query on our current system is (for client #4) > > Select (*) from client_4 where foo=2; > > A query from the new, proposed system would be > > Select (*) from big_results where client=4 and foo=2. > > The big questions is, WHICH WILL BE FASTER with Postgres. Is there any > performance improvement or cost to switching to this new structure. Oh, no question query 1 will be faster ... FOR THAT QUERY. You are asking the wrong question. However, explain to me how, under the current system, you can find the client who ordered $3000 worth of widgets on January 12th if you don't already know who it is? I'm not sure a 3000-table UNION query is even *possible*. Or how about giving me the average number of customer transactions in a month, across all clients? <rant> You've enslaved your application design to performance considerations ... an approach which was valid in 1990, because processing power was so limited then. But now that dual-processor servers with RAID can be had for less than $3000, there's simply no excuse for violating the principles of good relational database design just to speed up a query. Buying more RAM is much cheaper than having an engineer spend 3 weeks fixing data integrity problems. The proper way to go about application design is to build your application on paper or in a modelling program according to the best principles of software design available, and *then* to discuss performance issues -- addressing them *first* by buying hardware, and only compromising your applcation design when no other alternative is available. </rant> I strongly suggest that you purchase Pascal's "Practical Issues in Database Design" and give it a read. -- Josh Berkus Aglio Database Solutions San Francisco
On Thu, Jan 30, 2003 at 12:34:36PM -0500, Noah Silverman wrote: > Select (*) from client_4 where foo=2; > > A query from the new, proposed system would be > > Select (*) from big_results where client=4 and foo=2. > > The big questions is, WHICH WILL BE FASTER with Postgres. Is there any > performance improvement or cost to switching to this new structure. Faster overall, or faster for that operation? I can't prove it, but I suspect that the first one will return faster just because both the index and the table itself is smaller. The possibility is thatit will cause you problems overall, however, because of the large number of files you have to keep if you use 3000 tables. This is dependent on your filesytem (and its implementation). Note, too, that a lot of transactions frequently updating the table might make a difference. A large number of dead tuples sitting on a 10 million row table will make anything crawl. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
OK, Thanks for the quick responses. A bit more information. We are in the business of gathering data for our clients. (We're a news service). Subsequently, we do a lot of inserting and very rarely do any deleting. (We periodically clear out results that are over 6 months old.) On a give day, we will insert around 100,000 records in total. (Currently split across all the client tables). A challenging part of the process is that we have to keep track of previous content that may be similar. We CAN'T do this with a unique index (don't ask, it would take too long to explain, but trust me, it isn't possible). So, we have to query the table first and then compare the results of that query to what we are inserting. SO, we probably do close to 1 million queries, but then only make about 100,000 inserts. The basic flow is 1) our system finds something it likes, 2) query the table to see if something similar already exists, 3) if nothing similar exists, insert. While all this is going on, our clients are accessing our online reporting system. This system makes a variety of count and record requests from the database. As I mentioned in our earlier post, we are attempting to decide if Postgres will run faster/better/ with one big table, or a bunch of smaller ones. It really doesn't make much difference to us, we just want whatever structure will be faster. Thanks, -N
On Thu, 30 Jan 2003, Josh Berkus wrote: > > The proper way to go about application design is to build your application on > paper or in a modelling program according to the best principles of software > design available, and *then* to discuss performance issues -- addressing them > *first* by buying hardware, and only compromising your applcation design when > no other alternative is available. > App design & performance go hand-in-hand. the trick is to balance them. Who wants a _wonderful_ design that runs like a piece of poo? in this case I agree with you - not the best design around. buying hardware to fix speed problems is useful, but the software side should not be neglected - imagine this scenario using your methods (with a wonderful pg performance problem in hand (unless you are running cvs)) User has a schema and writes a query along the lines of select somevalue from sometable where othervalue not in (select badvalues from badvaluetable where id = 12345) we all know this runs horrifically on postgres. using your method I should go out, spend thousands on multi-processor boxes, raid, ram If you do a little app tuning (maybe spend 10-30 minutes readig pgsql archives) you'll learn to rewrite it as an exists query and make it faster than it ever could have been on the fast hardware. I just saved the company $10k too! (depends on if you consider that change a design change).. some designs are fatally flawed from the start. but hey.. oh well. 'tis a fine line though.. balancing hardware vs software optimization. (I'm also guessing they are not constrained by things such as an embedded system too) ------------------------------------------------------------------------------ Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ Ronald McDonald, with the help of cheese soup, controls America from a secret volkswagon hidden in the past -------------------------------------------------------------------------------
I'm going to go against the grain here and say that if you already have all of the code and schema worked out, you probably should stick with the many table design. While there are many reasons you'd be better off with the one big table design, a speed increase really isn't one of them. If you we're starting from scratch, or even had a slew of development work you we're planning to do, I'd probably recommend the one big table approach, but if you don't have any bottlenecks in your current system and the type of query you've given is typical of the majority of what your application is doing, there's no sense redesigning your application in the middle of a database switch. Robert Treat PS. Josh, are you referring to Pascal's "Practical Issues In Database Management" book or does he have a different book out that I'm not aware of? On Thu, 2003-01-30 at 13:24, Noah Silverman wrote: > OK, > > Thanks for the quick responses. > > A bit more information. > > We are in the business of gathering data for our clients. (We're a news > service). Subsequently, we do a lot of inserting and very rarely do > any deleting. (We periodically clear out results that are over 6 months > old.) > > On a give day, we will insert around 100,000 records in total. > (Currently split across all the client tables). > > A challenging part of the process is that we have to keep track of > previous content that may be similar. We CAN'T do this with a unique > index (don't ask, it would take too long to explain, but trust me, it > isn't possible). So, we have to query the table first and then compare > the results of that query to what we are inserting. SO, we probably do > close to 1 million queries, but then only make about 100,000 inserts. > The basic flow is 1) our system finds something it likes, 2) query the > table to see if something similar already exists, 3) if nothing similar > exists, insert. > > While all this is going on, our clients are accessing our online > reporting system. This system makes a variety of count and record > requests from the database. > > As I mentioned in our earlier post, we are attempting to decide if > Postgres will run faster/better/ with one big table, or a bunch of > smaller ones. It really doesn't make much difference to us, we just > want whatever structure will be faster. > > Thanks, > > -N >
I have a database running on PostgresSQL w/ close to 7 million records in one table and ~ 3 million in another, along w/ various smaller supportive tables. Before I started here everything was run out of small tables, one for each client, similar ( i think ) to what you are doing now. We submit ~ 50 - 100K records each week. And before we moved to one table, our company had no idea of how it was doing on a daily, weekly or monthly basis. Now that we have moved to one large structure, new ideas about reporting funtions and added services we can give to our clients are poping up all the time. There are MANY benifts to following Josh's advice and putting all your information in one table. Others than those given, what if you wanted to give an added service to your clients where they are made aware of similar postings by your other clients. Running this kind of report would be a nightmare in your current situation. As far as performance goes, I am able to join these 2 tables along w/ others and get the information, counts etc., that I need, using some rather complicated queries, in about 2-3 seconds per query. While this may sound awful realize that Im running on a standard workstation PIII 700, and for the money, Its a dream! More importantly you need to realize, as my coworkers have now done, that anything that you can do w/ a small table, you can do w/ one big table and an extra line in the where clause (eg. Where client_id = 'blah' ). PostgresSQL has wonderful support and many excellent DBA's that if you post a SQL problem they are very supportive in helping solve the problem. I hope this helps make your decision. Thanks Chad ----- Original Message ----- From: "Noah Silverman" <noah@allresearch.com> To: <pgsql-performance@postgresql.org> Cc: <pgsql-performance@postgresql.org> Sent: Thursday, January 30, 2003 11:24 AM Subject: Re: [PERFORM] One large v. many small > OK, > > Thanks for the quick responses. > > A bit more information. > > We are in the business of gathering data for our clients. (We're a news > service). Subsequently, we do a lot of inserting and very rarely do > any deleting. (We periodically clear out results that are over 6 months > old.) > > On a give day, we will insert around 100,000 records in total. > (Currently split across all the client tables). > > A challenging part of the process is that we have to keep track of > previous content that may be similar. We CAN'T do this with a unique > index (don't ask, it would take too long to explain, but trust me, it > isn't possible). So, we have to query the table first and then compare > the results of that query to what we are inserting. SO, we probably do > close to 1 million queries, but then only make about 100,000 inserts. > The basic flow is 1) our system finds something it likes, 2) query the > table to see if something similar already exists, 3) if nothing similar > exists, insert. > > While all this is going on, our clients are accessing our online > reporting system. This system makes a variety of count and record > requests from the database. > > As I mentioned in our earlier post, we are attempting to decide if > Postgres will run faster/better/ with one big table, or a bunch of > smaller ones. It really doesn't make much difference to us, we just > want whatever structure will be faster. > > Thanks, > > -N > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
On Thu, 30 Jan 2003, Josh Berkus wrote: > > Another idea that we are considering is one big table instead of 3,000 > > smaller ones. We could simply add a numeric field to indicate which > > client a particular record was for. > > Yes. Absolutely. Although I'd suggest an Integer field. From the description given in Noah's message, and also the one given in his later message, I have little doubt that 3000 small tables are going to be significantly faster than one large table. If you don't believe me, work out just where the disk blocks are going to end up, and how many blocks are going to have to be fetched for his typical query in a semi-clustered or non-clustered table. (If postgres had clustered indexes a la MS SQL server, where the rows are physically stored in the order of the clustered index, it would be a different matter.) > However, explain to me how, under the current system, you can find the client > who ordered $3000 worth of widgets on January 12th if you don't already know > who it is? Explain to me why he has to do this. It's all very nice to have a general system that can do well on all sorts of queries, but if you lose time on the queries you do do, in order to save time on queries you don't do, you're definitely not getting the best performance out of the system. > I'm not sure a 3000-table UNION query is even *possible*. This is not the only solution, either. You could simply just do 3000 queries. If this is something you execute only once a month, the making that query three or four orders of magnitude more expensive might be a small price to pay for making cheaper the queries you run several times per second. > <rant> > > You've enslaved your application design to performance considerations ... an > approach which was valid in 1990, because processing power was so limited > then. But now that dual-processor servers with RAID can be had for less than > $3000, there's simply no excuse for violating the principles of good > relational database design just to speed up a query. Buying more RAM is > much cheaper than having an engineer spend 3 weeks fixing data integrity > problems. *Sigh.* Ok, my turn to rant. RAM is not cheap enough yet for me to buy several hundred gigabytes of it for typical applications, even if I could find a server that I could put it in. Disk performance is not growing the way CPU performance is. And three weeks of engineering time plus a ten thousand dollar server is, even at my top billing rate, still a heck of a lot cheaper than a quarter-million dollar server. Applying your strategy to all situations is not always going to produce the most cost-effective solution. And for most businesses, that's what it's all about. They're not interested in the more "thoretically pure" way of doing things except insofar as it makes them money. As for the data integrity problems, I don't know where that came from. I think that was made up out of whole cloth, because it didn't seem to me that the original question involved any. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Noah, Well, there you have it: a unanimous consensus of opinion. You should either combine all of your tables or not. But definitely one or the other. <grin> Hope you feel informed now. -Josh
On Thursday 30 Jan 2003 11:54 pm, you wrote: > As I mentioned in our earlier post, we are attempting to decide if > Postgres will run faster/better/ with one big table, or a bunch of > smaller ones. It really doesn't make much difference to us, we just > want whatever structure will be faster. I would say create a big table with client id. Create a index on it and create 3000 views. Of course you need to figure out SQL voodoo to insert into postgresql views using rules. But that would save you from modifying your app. up and down. But there is going to be massive framgmentation. Consider clustering tables once in a while. HTH Shridhar
Folks, Many, many replies on this topic: Jeff: > App design & performance go hand-in-hand. the trick is to balance them. > Who wants a _wonderful_ design that runs like a piece of poo? in this <snip> >Select somevalue from sometable where othervalue not in (select badvalues > from badvaluetable where id = 12345) > we all know this runs horrifically on postgres. using your method I should > go out, spend thousands on multi-processor boxes, raid, ram Sorry, no, Jeff. The above is what one calls a "bad query" and is not, therefore, a performance vs. design issue: that query is bad design-wise, and bad performance-wise. Perhpas another example of your argument? Since you do not seem to have understood my argument, it is this: Design changes, made for the sake of performance or rapid app building, which completely violate good RDBMS design and normalization principles, almost always cost you more over the life of the application than you gain in performance in the short term. Curt: > It's all very nice to have a general system that can do well on all > sorts of queries, but if you lose time on the queries you do do, in > order to save time on queries you don't do, you're definitely not > getting the best performance out of the system. This is a good point; I tend to build for posterity because, so far, 90% of my clients who started out having me build a "single-purpose" database ended up expanding the application to cover 2-10 additional needs, thus forcing me to clean up any design shortcuts I took with the original app. However, Noah may have more control over his company than that. <and> > RAM is not cheap enough yet for me to buy several hundred gigabytes of > it for typical applications, even if I could find a server that I could > put it in. Disk performance is not growing the way CPU performance is. > And three weeks of engineering time plus a ten thousand dollar server > is, even at my top billing rate, still a heck of a lot cheaper than a > quarter-million dollar server. I was thinking more of the difference between a $3000 server and a $9000 server myself; unless you're doing nuclear test modelling, I don't see any need for a $250,000 server for anything. To give an extreme example, I have a client who purchased a $150,000 accounting system that turned out to be badly designed, normalization-wise, partly because the accounting system engineers were focusing on 8-year-old technology with performance restrictions which were no longer really applicable (for example, they talked the client into buying a quad-processor server and then wrote all of their middleware code on a platform that does not do SMP). Over the last two years, they have paid my company $175,000 to "fix" this accounting database ... more, in fact, than I would have charged them to write a better system from scratch. <and> > Applying your strategy to all situations is not always going to produce > the most cost-effective solution. That's very true. In fact, that could be taken as a "general truism" ... no one strategy applies to *all* situations. > PS. Josh, are you referring to Pascal's "Practical Issues In Database > Management" book or does he have a different book out that I'm not aware > of? Yes, you're correct. Sorry! -- Josh Berkus Aglio Database Solutions San Francisco
Curt Sampson wrote: > >From the description given in Noah's message, and also the > one given in his later message, I have little doubt that 3000 > small tables are going to be significantly faster than one > large table. If you don't believe me, work out just where the > disk blocks are going to end up, and how many blocks are going > to have to be fetched for his typical query in a semi-clustered or > non-clustered table. You may be right, Curt, but I've seen unintuitive results for this kind of thing in the past. Depending on the way the records are accessed and the cache size, the exact opposite could be true. The index pages will most likely rarely be in memory when you have 3000 different tables. Meaning that each search will require at least three or four index page retrievals plus the tuple page. So what you might lose due to lack of clustering will be made up by the more efficient caching of the upper levels of the index btree pages. Combine a multi-part index (on both client and foo, which order would depend on the access required) that is clustered once a week or so using the admittedly non-optimal PostgreSQL CLUSTER command and I'll bet you can get equivalent or better performance with the single table with the concomitant advantages of much better reporting options. I've also seen many examples of linear algorithms in database data dictionaries which would cause a 3000+ table database to perform poorly during the parsing/query optimization stage. I don't have any idea whether or not PostgreSQL suffers from this problem. I don't think there is any substitute for just trying it out. It shouldn't be that hard to create a bunch of SQL statements that concatenate the tables into one large one. Try the most common queries against both scenarios. You might be surprised. - Curtis
On Fri, 31 Jan 2003, Curtis Faith wrote: > Depending on the way the records are accessed and the cache size, > the exact opposite could be true. The index pages will most likely > rarely be in memory when you have 3000 different tables. Meaning > that each search will require at least three or four index page > retrievals plus the tuple page. Assuming you're using indexes at all. If you're tending to use table scans, this doesn't matter. From Noah's description it seemed he was--he said that a particular data item couldn't be the primary key, presumably because he couldn't index it reasonably. But this just my guess, not a fact. > Combine a multi-part index (on both client and foo, which order > would depend on the access required) that is clustered once a week > or so using the admittedly non-optimal PostgreSQL CLUSTER command > and I'll bet you can get equivalent or better performance... I would say that, just after a CLUSTER, you're likely to see better performance because this would have the effect, on a FFS or similar filesystem where you've got plenty of free space, of physically clustering data that would not have been clustered in the case of a lot of small tables that see a lot of appending evenly over all of them over the course of time. So the tradeoff there is really, can you afford the time for the CLUSTER? (In a system where you have a lot of maintenance time, probably. Though if it's a huge table, this might need an entire weekend. In a system that needs to be up 24/7, probably not, unless you have lots of spare I/O capacity.) Just out of curiousity, how does CLUSTER deal with updates to the table while the CLUSTER command is running? > I don't think there is any substitute for just trying it out. It > shouldn't be that hard to create a bunch of SQL statements that > concatenate the tables into one large one. I entirely agree! There are too many unknowns here to do more than speculate on this list. But thanks for enlightening me on situations where one big table perform better. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > So the tradeoff there is really, can you afford the time for the CLUSTER? > (In a system where you have a lot of maintenance time, probably. Though if > it's a huge table, this might need an entire weekend. In a system that needs > to be up 24/7, probably not, unless you have lots of spare I/O capacity.) > Just out of curiousity, how does CLUSTER deal with updates to the table while > the CLUSTER command is running? CLUSTER locks the table, so no updates can happen during a cluster. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073