Home > mailing lists

Re: Database design wisdom needed - Mailing list pgsql-general

From	Ron Johnson
Subject	Re: Database design wisdom needed
Date	June 5, 2007 10:40:10
Msg-id	46656638.2070901@cox.net Whole thread Raw
In response to	Database design wisdom needed ("Erick Papadakis" <erick.papa@gmail.com>)
List	pgsql-general

Tree view

On 06/05/07 00:46, Erick Papadakis wrote:
> Hi
>
> Sorry for this somewhat long email but I think it is relevant to most
> people who run online databases. I am having trouble optimizing UPDATE
> queries on a certain semi-large table that is only growing larger.
> I've come across some very interesting thoughts from this list, so I
> thought I'll post my conundrum here. Would truly appreciate any
> guidance or pointers.
>
> I have done the messy DBA's job of breaking one table into three,
> which should actually have been one. Just did it to make sure the
> large main table doesn't keep getting updated with small crumbs of
> information.
>
> This main table stores what users of our online service have created.
> These are small Flash games, but over the last seven years, we have
> collected about 5 million of these. The primary key of this table is
> "GAME_ID". In our online account management interface I also need to
> show each producer's creations only to himself, so of course I need a
> "PRODUCER_ID".
>
> Above that, in our service we have had three more bits of information
> for each created game:
>
> - Number of views for the game (online views)
> - Number of plays of the game (online people playing)
> - Number of unique plays of the game (based on ip or
> subscriber_id..some games do not require login to play)
>
> Because each of these pieces of information was individual to each
> game, my original table looked like this:
>
>
>    GAME Table
>    -----------
>    GAME_ID
>    PRODUCER_ID
>    VIEWS_COUNT
>    PLAYED_COUNT
>    PLAYED_COUNT_UNIQUE
>
>
> In the beginning, everytime someone played a game online, we updated
> the PLAYED_COUNT in this table. When someone viewed a game, we updated
> the VIEWS_COUNT. It was simple and it worked as it reflected the
> business logic.
>
> Over time, however, I realized that while GAME_ID and PRODUCER_ID
> remained basically static, the next three columns had a very high rate
> of change. For example, VIEWS_COUNT would increment every second for
> popular games. The PLAYED_COUNT would increment everytime someone
> played, but the fact that we have "unique" into the equation means
> that I had to break this away into a log table, so that I could GROUP
> BY ip address and then update the columns accordingly in the GAME
> table.  So I had:
>
>
>    GAME_PLAYED_LOG Table
>    ----------------------
>    GAME_ID
>    PLAYER_IP_ADDR
>
>
>  From here, I would select COUNT(PLAYER_IP_ADDR) and COUNT(DISTINCT
> PLAYER_IP_ADDR), grouped by GAME_ID.  I would then update the main
> GAME table's columns with these values. This process was done every
> hour to make sure we didn't have a huge number of rows to manipulate.
>
> Anyway, now we're realizing that we are becoming increasingly popular
> and our tables are large enough for us to require optimization.
> Because GAME table is mostly for static information, and is also
> required to be highly available because of our online interface, I
> have just taken away the VIEWS_COUNT and PLAYED_COUNT into another
> table.
>
>
>    GAME Table (~5 million rows)
>    ----------------------------
>    GAME_ID
>    PRODUCER_ID
>
>
>    GAME_COUNTS Table (also ~5 million rows of course)
>    ---------------------------------------------------
>    GAME_ID
>    VIEWS_COUNT
>    PLAYED_COUNT
>    PLAYED_COUNT_UNIQUE
>
>
> This way, from our GAME_PLAYED_LOG, we do not need to update the main
> GAME table every hour, but only the GAME_COUNTS. This leaves the GAME
> table free to do its job. That is our main table, so keeping it static
> was our priority.

And good database theory, too, since the number of times that a game
is played is not "tied" to the PRODUCER_ID.

<golf clap>

> My problems:
>
> 1. Because we are a popular website, the GAME_PLAYED_LOG table grows
> at massive rates every hour. About 1,000,000 records. Doing COUNT
> queries on this table already is pretty resource hogging, even if we
> do them every hour. Especially the DISTINCT grouping to get unique
> played count.

Any index on GAME_PLAYED_LOG?

> 1.5. After we update the GAME_COUNTS table, we also truncate the
> GAME_PLAYED_LOG table because its records have no significance
> anymore. This hourly deletion leads to fragmentation etc in the table.
>
> 2. The UPDATE queries to GAME_COUNTS table are also a problem. The
> sheer number of UPDATE queries is massive. Of the 1,000,000 views
> every hour, many are just one or two views of many, many games
> (remember, we have 5,000,000 games). So every hour we end up running
> thousands of small UPDATE queries like:
>
>   update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1...
>
> 3. Finally, the JOIN queries between the GAME table and GAME_COUNTS
> table are not very fast. They take about 1 second each, even if I do a
> LIMIT 20 in every query.
>
>
> Now I suspect this scenario is *very* common in online databases that
> must report viewer statistics. How are you guys doing it? What am I
> missing? Isn't the decoupling of the static information from the more
> frequently updated information a good idea? Would using a different
> storage engine for different kinds of tables help -- engines that were
> better for INSERT and SELECT, while others that were good for UPDATE?
> We ran MySQL until a year ago but we have switched to Pg since we had
> data corruption issues. I am sure I'm missing some tricks in the Pg
> world, and would truly appreciate any ideas.
>
> If you are reading until here, well done! And so many thanks in
> advance for any insight you can shed into this matter.

What index(es) is/are on GAME and GAME_COUNTS?

What version of PG are you running?

Are you pushing the box too hard?

--
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

pgsql-general by date:

From: Erwin Brandstetter
Date: 05 June 2007, 10:39:38
Subject: Re: There can be only one! How to avoid the "highlander-problem".

From: Erwin Brandstetter
Date: 05 June 2007, 10:47:37
Subject: Re: There can be only one! How to avoid the "highlander-problem".

Re: Database design wisdom needed - Mailing list pgsql-general

Previous

Next