Re: Database design wisdom needed - Mailing list pgsql-general
From | Ron Johnson |
---|---|
Subject | Re: Database design wisdom needed |
Date | |
Msg-id | 46656638.2070901@cox.net Whole thread Raw |
In response to | Database design wisdom needed ("Erick Papadakis" <erick.papa@gmail.com>) |
List | pgsql-general |
On 06/05/07 00:46, Erick Papadakis wrote: > Hi > > Sorry for this somewhat long email but I think it is relevant to most > people who run online databases. I am having trouble optimizing UPDATE > queries on a certain semi-large table that is only growing larger. > I've come across some very interesting thoughts from this list, so I > thought I'll post my conundrum here. Would truly appreciate any > guidance or pointers. > > I have done the messy DBA's job of breaking one table into three, > which should actually have been one. Just did it to make sure the > large main table doesn't keep getting updated with small crumbs of > information. > > This main table stores what users of our online service have created. > These are small Flash games, but over the last seven years, we have > collected about 5 million of these. The primary key of this table is > "GAME_ID". In our online account management interface I also need to > show each producer's creations only to himself, so of course I need a > "PRODUCER_ID". > > Above that, in our service we have had three more bits of information > for each created game: > > - Number of views for the game (online views) > - Number of plays of the game (online people playing) > - Number of unique plays of the game (based on ip or > subscriber_id..some games do not require login to play) > > Because each of these pieces of information was individual to each > game, my original table looked like this: > > > GAME Table > ----------- > GAME_ID > PRODUCER_ID > VIEWS_COUNT > PLAYED_COUNT > PLAYED_COUNT_UNIQUE > > > In the beginning, everytime someone played a game online, we updated > the PLAYED_COUNT in this table. When someone viewed a game, we updated > the VIEWS_COUNT. It was simple and it worked as it reflected the > business logic. > > Over time, however, I realized that while GAME_ID and PRODUCER_ID > remained basically static, the next three columns had a very high rate > of change. For example, VIEWS_COUNT would increment every second for > popular games. The PLAYED_COUNT would increment everytime someone > played, but the fact that we have "unique" into the equation means > that I had to break this away into a log table, so that I could GROUP > BY ip address and then update the columns accordingly in the GAME > table. So I had: > > > GAME_PLAYED_LOG Table > ---------------------- > GAME_ID > PLAYER_IP_ADDR > > > From here, I would select COUNT(PLAYER_IP_ADDR) and COUNT(DISTINCT > PLAYER_IP_ADDR), grouped by GAME_ID. I would then update the main > GAME table's columns with these values. This process was done every > hour to make sure we didn't have a huge number of rows to manipulate. > > Anyway, now we're realizing that we are becoming increasingly popular > and our tables are large enough for us to require optimization. > Because GAME table is mostly for static information, and is also > required to be highly available because of our online interface, I > have just taken away the VIEWS_COUNT and PLAYED_COUNT into another > table. > > > GAME Table (~5 million rows) > ---------------------------- > GAME_ID > PRODUCER_ID > > > GAME_COUNTS Table (also ~5 million rows of course) > --------------------------------------------------- > GAME_ID > VIEWS_COUNT > PLAYED_COUNT > PLAYED_COUNT_UNIQUE > > > This way, from our GAME_PLAYED_LOG, we do not need to update the main > GAME table every hour, but only the GAME_COUNTS. This leaves the GAME > table free to do its job. That is our main table, so keeping it static > was our priority. And good database theory, too, since the number of times that a game is played is not "tied" to the PRODUCER_ID. <golf clap> > My problems: > > 1. Because we are a popular website, the GAME_PLAYED_LOG table grows > at massive rates every hour. About 1,000,000 records. Doing COUNT > queries on this table already is pretty resource hogging, even if we > do them every hour. Especially the DISTINCT grouping to get unique > played count. Any index on GAME_PLAYED_LOG? > 1.5. After we update the GAME_COUNTS table, we also truncate the > GAME_PLAYED_LOG table because its records have no significance > anymore. This hourly deletion leads to fragmentation etc in the table. > > 2. The UPDATE queries to GAME_COUNTS table are also a problem. The > sheer number of UPDATE queries is massive. Of the 1,000,000 views > every hour, many are just one or two views of many, many games > (remember, we have 5,000,000 games). So every hour we end up running > thousands of small UPDATE queries like: > > update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1... > > 3. Finally, the JOIN queries between the GAME table and GAME_COUNTS > table are not very fast. They take about 1 second each, even if I do a > LIMIT 20 in every query. > > > Now I suspect this scenario is *very* common in online databases that > must report viewer statistics. How are you guys doing it? What am I > missing? Isn't the decoupling of the static information from the more > frequently updated information a good idea? Would using a different > storage engine for different kinds of tables help -- engines that were > better for INSERT and SELECT, while others that were good for UPDATE? > We ran MySQL until a year ago but we have switched to Pg since we had > data corruption issues. I am sure I'm missing some tricks in the Pg > world, and would truly appreciate any ideas. > > If you are reading until here, well done! And so many thanks in > advance for any insight you can shed into this matter. What index(es) is/are on GAME and GAME_COUNTS? What version of PG are you running? Are you pushing the box too hard? -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good!
pgsql-general by date: