Thread: Database design wisdom needed
Hi Sorry for this somewhat long email but I think it is relevant to most people who run online databases. I am having trouble optimizing UPDATE queries on a certain semi-large table that is only growing larger. I've come across some very interesting thoughts from this list, so I thought I'll post my conundrum here. Would truly appreciate any guidance or pointers. I have done the messy DBA's job of breaking one table into three, which should actually have been one. Just did it to make sure the large main table doesn't keep getting updated with small crumbs of information. This main table stores what users of our online service have created. These are small Flash games, but over the last seven years, we have collected about 5 million of these. The primary key of this table is "GAME_ID". In our online account management interface I also need to show each producer's creations only to himself, so of course I need a "PRODUCER_ID". Above that, in our service we have had three more bits of information for each created game: - Number of views for the game (online views) - Number of plays of the game (online people playing) - Number of unique plays of the game (based on ip or subscriber_id..some games do not require login to play) Because each of these pieces of information was individual to each game, my original table looked like this: GAME Table ----------- GAME_ID PRODUCER_ID VIEWS_COUNT PLAYED_COUNT PLAYED_COUNT_UNIQUE In the beginning, everytime someone played a game online, we updated the PLAYED_COUNT in this table. When someone viewed a game, we updated the VIEWS_COUNT. It was simple and it worked as it reflected the business logic. Over time, however, I realized that while GAME_ID and PRODUCER_ID remained basically static, the next three columns had a very high rate of change. For example, VIEWS_COUNT would increment every second for popular games. The PLAYED_COUNT would increment everytime someone played, but the fact that we have "unique" into the equation means that I had to break this away into a log table, so that I could GROUP BY ip address and then update the columns accordingly in the GAME table. So I had: GAME_PLAYED_LOG Table ---------------------- GAME_ID PLAYER_IP_ADDR From here, I would select COUNT(PLAYER_IP_ADDR) and COUNT(DISTINCT PLAYER_IP_ADDR), grouped by GAME_ID. I would then update the main GAME table's columns with these values. This process was done every hour to make sure we didn't have a huge number of rows to manipulate. Anyway, now we're realizing that we are becoming increasingly popular and our tables are large enough for us to require optimization. Because GAME table is mostly for static information, and is also required to be highly available because of our online interface, I have just taken away the VIEWS_COUNT and PLAYED_COUNT into another table. GAME Table (~5 million rows) ---------------------------- GAME_ID PRODUCER_ID GAME_COUNTS Table (also ~5 million rows of course) --------------------------------------------------- GAME_ID VIEWS_COUNT PLAYED_COUNT PLAYED_COUNT_UNIQUE This way, from our GAME_PLAYED_LOG, we do not need to update the main GAME table every hour, but only the GAME_COUNTS. This leaves the GAME table free to do its job. That is our main table, so keeping it static was our priority. My problems: 1. Because we are a popular website, the GAME_PLAYED_LOG table grows at massive rates every hour. About 1,000,000 records. Doing COUNT queries on this table already is pretty resource hogging, even if we do them every hour. Especially the DISTINCT grouping to get unique played count. 1.5. After we update the GAME_COUNTS table, we also truncate the GAME_PLAYED_LOG table because its records have no significance anymore. This hourly deletion leads to fragmentation etc in the table. 2. The UPDATE queries to GAME_COUNTS table are also a problem. The sheer number of UPDATE queries is massive. Of the 1,000,000 views every hour, many are just one or two views of many, many games (remember, we have 5,000,000 games). So every hour we end up running thousands of small UPDATE queries like: update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1... 3. Finally, the JOIN queries between the GAME table and GAME_COUNTS table are not very fast. They take about 1 second each, even if I do a LIMIT 20 in every query. Now I suspect this scenario is *very* common in online databases that must report viewer statistics. How are you guys doing it? What am I missing? Isn't the decoupling of the static information from the more frequently updated information a good idea? Would using a different storage engine for different kinds of tables help -- engines that were better for INSERT and SELECT, while others that were good for UPDATE? We ran MySQL until a year ago but we have switched to Pg since we had data corruption issues. I am sure I'm missing some tricks in the Pg world, and would truly appreciate any ideas. If you are reading until here, well done! And so many thanks in advance for any insight you can shed into this matter. .EP
On 06/05/07 00:46, Erick Papadakis wrote: > Hi > > Sorry for this somewhat long email but I think it is relevant to most > people who run online databases. I am having trouble optimizing UPDATE > queries on a certain semi-large table that is only growing larger. > I've come across some very interesting thoughts from this list, so I > thought I'll post my conundrum here. Would truly appreciate any > guidance or pointers. > > I have done the messy DBA's job of breaking one table into three, > which should actually have been one. Just did it to make sure the > large main table doesn't keep getting updated with small crumbs of > information. > > This main table stores what users of our online service have created. > These are small Flash games, but over the last seven years, we have > collected about 5 million of these. The primary key of this table is > "GAME_ID". In our online account management interface I also need to > show each producer's creations only to himself, so of course I need a > "PRODUCER_ID". > > Above that, in our service we have had three more bits of information > for each created game: > > - Number of views for the game (online views) > - Number of plays of the game (online people playing) > - Number of unique plays of the game (based on ip or > subscriber_id..some games do not require login to play) > > Because each of these pieces of information was individual to each > game, my original table looked like this: > > > GAME Table > ----------- > GAME_ID > PRODUCER_ID > VIEWS_COUNT > PLAYED_COUNT > PLAYED_COUNT_UNIQUE > > > In the beginning, everytime someone played a game online, we updated > the PLAYED_COUNT in this table. When someone viewed a game, we updated > the VIEWS_COUNT. It was simple and it worked as it reflected the > business logic. > > Over time, however, I realized that while GAME_ID and PRODUCER_ID > remained basically static, the next three columns had a very high rate > of change. For example, VIEWS_COUNT would increment every second for > popular games. The PLAYED_COUNT would increment everytime someone > played, but the fact that we have "unique" into the equation means > that I had to break this away into a log table, so that I could GROUP > BY ip address and then update the columns accordingly in the GAME > table. So I had: > > > GAME_PLAYED_LOG Table > ---------------------- > GAME_ID > PLAYER_IP_ADDR > > > From here, I would select COUNT(PLAYER_IP_ADDR) and COUNT(DISTINCT > PLAYER_IP_ADDR), grouped by GAME_ID. I would then update the main > GAME table's columns with these values. This process was done every > hour to make sure we didn't have a huge number of rows to manipulate. > > Anyway, now we're realizing that we are becoming increasingly popular > and our tables are large enough for us to require optimization. > Because GAME table is mostly for static information, and is also > required to be highly available because of our online interface, I > have just taken away the VIEWS_COUNT and PLAYED_COUNT into another > table. > > > GAME Table (~5 million rows) > ---------------------------- > GAME_ID > PRODUCER_ID > > > GAME_COUNTS Table (also ~5 million rows of course) > --------------------------------------------------- > GAME_ID > VIEWS_COUNT > PLAYED_COUNT > PLAYED_COUNT_UNIQUE > > > This way, from our GAME_PLAYED_LOG, we do not need to update the main > GAME table every hour, but only the GAME_COUNTS. This leaves the GAME > table free to do its job. That is our main table, so keeping it static > was our priority. And good database theory, too, since the number of times that a game is played is not "tied" to the PRODUCER_ID. <golf clap> > My problems: > > 1. Because we are a popular website, the GAME_PLAYED_LOG table grows > at massive rates every hour. About 1,000,000 records. Doing COUNT > queries on this table already is pretty resource hogging, even if we > do them every hour. Especially the DISTINCT grouping to get unique > played count. Any index on GAME_PLAYED_LOG? > 1.5. After we update the GAME_COUNTS table, we also truncate the > GAME_PLAYED_LOG table because its records have no significance > anymore. This hourly deletion leads to fragmentation etc in the table. > > 2. The UPDATE queries to GAME_COUNTS table are also a problem. The > sheer number of UPDATE queries is massive. Of the 1,000,000 views > every hour, many are just one or two views of many, many games > (remember, we have 5,000,000 games). So every hour we end up running > thousands of small UPDATE queries like: > > update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1... > > 3. Finally, the JOIN queries between the GAME table and GAME_COUNTS > table are not very fast. They take about 1 second each, even if I do a > LIMIT 20 in every query. > > > Now I suspect this scenario is *very* common in online databases that > must report viewer statistics. How are you guys doing it? What am I > missing? Isn't the decoupling of the static information from the more > frequently updated information a good idea? Would using a different > storage engine for different kinds of tables help -- engines that were > better for INSERT and SELECT, while others that were good for UPDATE? > We ran MySQL until a year ago but we have switched to Pg since we had > data corruption issues. I am sure I'm missing some tricks in the Pg > world, and would truly appreciate any ideas. > > If you are reading until here, well done! And so many thanks in > advance for any insight you can shed into this matter. What index(es) is/are on GAME and GAME_COUNTS? What version of PG are you running? Are you pushing the box too hard? -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good!
On Tue, Jun 05, 2007 at 01:46:42PM +0800, Erick Papadakis wrote: > > GAME_COUNTS Table (also ~5 million rows of course) > --------------------------------------------------- > GAME_ID > VIEWS_COUNT > PLAYED_COUNT > PLAYED_COUNT_UNIQUE This is a poor normalisation. While views_count is necessarily a superset of played_count, the two values are not functions of one another really (for instance, there will be games that have been viewed but never played); therefore they should be in separate tables, I'd say. I'd also like to point out that a 5 million row table is actually not very big. In order to make this fast, I'd probably implement a dirtyish hack along the following lines. create table game_views_summ ( game_id int references games(id), views bigint, counted_at timestamp with time zone); create table game_views ( game_id int references games(id), viewed_at timestamp with time zone); Then, you have a daemon that summarizes data in game_views regularly into game_views_summ and deletes the data in game_views that's just been updated. If you do this more or less all the time, you can keep the vacuums up to date and keep the bloat to a minimum. It's an increase in overall I/O, but it ought to be smoother than just trying to cope with it in big lumps. (A similar strategy will work for the play counts.) One thing to be careful about is that you don't end up with game_views with very few rows, but a huge number of dead rows. This two-table approach can rapidly become a pessimal implementation in the event you are too efficient at eliminating the rows-to-be-summarized, but have a lot of dead rows that are unrecoverable because of running transactions. You'll end up with a seqscan on a table that contains four live rows, except that it's 400M. That pattern is a killer. > 1.5. After we update the GAME_COUNTS table, we also truncate the > GAME_PLAYED_LOG table because its records have no significance > anymore. This hourly deletion leads to fragmentation etc in the table. Are you truncating, or deleting? TRUNCATE leaves no dead rows, fragmentation, &c. That's its point. > (remember, we have 5,000,000 games). So every hour we end up running > thousands of small UPDATE queries like: > > update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1... Why is this bad? (You don't actually need thousands of these, I think, because you ought to be able to design one query to do it all. But I'd be unhappy with the locking, I think, given what you're trying to do.) > 3. Finally, the JOIN queries between the GAME table and GAME_COUNTS > table are not very fast. They take about 1 second each, even if I do a > LIMIT 20 in every query. This sounds like something's wrong in your query or your plan. EXPLAIN ANALYSE is your friend here. I'd suspect vacuum issues. Oh, one other thing. I noted you're storing the player's IP address. You do know that maps very poorly to actual individuals on the other end, right? A -- Andrew Sullivan | ajs@crankycanuck.ca Unfortunately reformatting the Internet is a little more painful than reformatting your hard drive when it gets out of whack. --Scott Morris