Thread: Database design wisdom needed

Database design wisdom needed

From
"Erick Papadakis"
Date:
Hi

Sorry for this somewhat long email but I think it is relevant to most
people who run online databases. I am having trouble optimizing UPDATE
queries on a certain semi-large table that is only growing larger.
I've come across some very interesting thoughts from this list, so I
thought I'll post my conundrum here. Would truly appreciate any
guidance or pointers.

I have done the messy DBA's job of breaking one table into three,
which should actually have been one. Just did it to make sure the
large main table doesn't keep getting updated with small crumbs of
information.

This main table stores what users of our online service have created.
These are small Flash games, but over the last seven years, we have
collected about 5 million of these. The primary key of this table is
"GAME_ID". In our online account management interface I also need to
show each producer's creations only to himself, so of course I need a
"PRODUCER_ID".

Above that, in our service we have had three more bits of information
for each created game:

- Number of views for the game (online views)
- Number of plays of the game (online people playing)
- Number of unique plays of the game (based on ip or
subscriber_id..some games do not require login to play)

Because each of these pieces of information was individual to each
game, my original table looked like this:


    GAME Table
    -----------
    GAME_ID
    PRODUCER_ID
    VIEWS_COUNT
    PLAYED_COUNT
    PLAYED_COUNT_UNIQUE


In the beginning, everytime someone played a game online, we updated
the PLAYED_COUNT in this table. When someone viewed a game, we updated
the VIEWS_COUNT. It was simple and it worked as it reflected the
business logic.

Over time, however, I realized that while GAME_ID and PRODUCER_ID
remained basically static, the next three columns had a very high rate
of change. For example, VIEWS_COUNT would increment every second for
popular games. The PLAYED_COUNT would increment everytime someone
played, but the fact that we have "unique" into the equation means
that I had to break this away into a log table, so that I could GROUP
BY ip address and then update the columns accordingly in the GAME
table.  So I had:


    GAME_PLAYED_LOG Table
    ----------------------
    GAME_ID
    PLAYER_IP_ADDR


From here, I would select COUNT(PLAYER_IP_ADDR) and COUNT(DISTINCT
PLAYER_IP_ADDR), grouped by GAME_ID.  I would then update the main
GAME table's columns with these values. This process was done every
hour to make sure we didn't have a huge number of rows to manipulate.

Anyway, now we're realizing that we are becoming increasingly popular
and our tables are large enough for us to require optimization.
Because GAME table is mostly for static information, and is also
required to be highly available because of our online interface, I
have just taken away the VIEWS_COUNT and PLAYED_COUNT into another
table.


    GAME Table (~5 million rows)
    ----------------------------
    GAME_ID
    PRODUCER_ID


    GAME_COUNTS Table (also ~5 million rows of course)
    ---------------------------------------------------
    GAME_ID
    VIEWS_COUNT
    PLAYED_COUNT
    PLAYED_COUNT_UNIQUE


This way, from our GAME_PLAYED_LOG, we do not need to update the main
GAME table every hour, but only the GAME_COUNTS. This leaves the GAME
table free to do its job. That is our main table, so keeping it static
was our priority.

My problems:

1. Because we are a popular website, the GAME_PLAYED_LOG table grows
at massive rates every hour. About 1,000,000 records. Doing COUNT
queries on this table already is pretty resource hogging, even if we
do them every hour. Especially the DISTINCT grouping to get unique
played count.

1.5. After we update the GAME_COUNTS table, we also truncate the
GAME_PLAYED_LOG table because its records have no significance
anymore. This hourly deletion leads to fragmentation etc in the table.

2. The UPDATE queries to GAME_COUNTS table are also a problem. The
sheer number of UPDATE queries is massive. Of the 1,000,000 views
every hour, many are just one or two views of many, many games
(remember, we have 5,000,000 games). So every hour we end up running
thousands of small UPDATE queries like:

   update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1...

3. Finally, the JOIN queries between the GAME table and GAME_COUNTS
table are not very fast. They take about 1 second each, even if I do a
LIMIT 20 in every query.


Now I suspect this scenario is *very* common in online databases that
must report viewer statistics. How are you guys doing it? What am I
missing? Isn't the decoupling of the static information from the more
frequently updated information a good idea? Would using a different
storage engine for different kinds of tables help -- engines that were
better for INSERT and SELECT, while others that were good for UPDATE?
We ran MySQL until a year ago but we have switched to Pg since we had
data corruption issues. I am sure I'm missing some tricks in the Pg
world, and would truly appreciate any ideas.

If you are reading until here, well done! And so many thanks in
advance for any insight you can shed into this matter.

.EP

Re: Database design wisdom needed

From
Ron Johnson
Date:
On 06/05/07 00:46, Erick Papadakis wrote:
> Hi
>
> Sorry for this somewhat long email but I think it is relevant to most
> people who run online databases. I am having trouble optimizing UPDATE
> queries on a certain semi-large table that is only growing larger.
> I've come across some very interesting thoughts from this list, so I
> thought I'll post my conundrum here. Would truly appreciate any
> guidance or pointers.
>
> I have done the messy DBA's job of breaking one table into three,
> which should actually have been one. Just did it to make sure the
> large main table doesn't keep getting updated with small crumbs of
> information.
>
> This main table stores what users of our online service have created.
> These are small Flash games, but over the last seven years, we have
> collected about 5 million of these. The primary key of this table is
> "GAME_ID". In our online account management interface I also need to
> show each producer's creations only to himself, so of course I need a
> "PRODUCER_ID".
>
> Above that, in our service we have had three more bits of information
> for each created game:
>
> - Number of views for the game (online views)
> - Number of plays of the game (online people playing)
> - Number of unique plays of the game (based on ip or
> subscriber_id..some games do not require login to play)
>
> Because each of these pieces of information was individual to each
> game, my original table looked like this:
>
>
>    GAME Table
>    -----------
>    GAME_ID
>    PRODUCER_ID
>    VIEWS_COUNT
>    PLAYED_COUNT
>    PLAYED_COUNT_UNIQUE
>
>
> In the beginning, everytime someone played a game online, we updated
> the PLAYED_COUNT in this table. When someone viewed a game, we updated
> the VIEWS_COUNT. It was simple and it worked as it reflected the
> business logic.
>
> Over time, however, I realized that while GAME_ID and PRODUCER_ID
> remained basically static, the next three columns had a very high rate
> of change. For example, VIEWS_COUNT would increment every second for
> popular games. The PLAYED_COUNT would increment everytime someone
> played, but the fact that we have "unique" into the equation means
> that I had to break this away into a log table, so that I could GROUP
> BY ip address and then update the columns accordingly in the GAME
> table.  So I had:
>
>
>    GAME_PLAYED_LOG Table
>    ----------------------
>    GAME_ID
>    PLAYER_IP_ADDR
>
>
>  From here, I would select COUNT(PLAYER_IP_ADDR) and COUNT(DISTINCT
> PLAYER_IP_ADDR), grouped by GAME_ID.  I would then update the main
> GAME table's columns with these values. This process was done every
> hour to make sure we didn't have a huge number of rows to manipulate.
>
> Anyway, now we're realizing that we are becoming increasingly popular
> and our tables are large enough for us to require optimization.
> Because GAME table is mostly for static information, and is also
> required to be highly available because of our online interface, I
> have just taken away the VIEWS_COUNT and PLAYED_COUNT into another
> table.
>
>
>    GAME Table (~5 million rows)
>    ----------------------------
>    GAME_ID
>    PRODUCER_ID
>
>
>    GAME_COUNTS Table (also ~5 million rows of course)
>    ---------------------------------------------------
>    GAME_ID
>    VIEWS_COUNT
>    PLAYED_COUNT
>    PLAYED_COUNT_UNIQUE
>
>
> This way, from our GAME_PLAYED_LOG, we do not need to update the main
> GAME table every hour, but only the GAME_COUNTS. This leaves the GAME
> table free to do its job. That is our main table, so keeping it static
> was our priority.

And good database theory, too, since the number of times that a game
is played is not "tied" to the PRODUCER_ID.

<golf clap>

> My problems:
>
> 1. Because we are a popular website, the GAME_PLAYED_LOG table grows
> at massive rates every hour. About 1,000,000 records. Doing COUNT
> queries on this table already is pretty resource hogging, even if we
> do them every hour. Especially the DISTINCT grouping to get unique
> played count.

Any index on GAME_PLAYED_LOG?

> 1.5. After we update the GAME_COUNTS table, we also truncate the
> GAME_PLAYED_LOG table because its records have no significance
> anymore. This hourly deletion leads to fragmentation etc in the table.
>
> 2. The UPDATE queries to GAME_COUNTS table are also a problem. The
> sheer number of UPDATE queries is massive. Of the 1,000,000 views
> every hour, many are just one or two views of many, many games
> (remember, we have 5,000,000 games). So every hour we end up running
> thousands of small UPDATE queries like:
>
>   update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1...
>
> 3. Finally, the JOIN queries between the GAME table and GAME_COUNTS
> table are not very fast. They take about 1 second each, even if I do a
> LIMIT 20 in every query.
>
>
> Now I suspect this scenario is *very* common in online databases that
> must report viewer statistics. How are you guys doing it? What am I
> missing? Isn't the decoupling of the static information from the more
> frequently updated information a good idea? Would using a different
> storage engine for different kinds of tables help -- engines that were
> better for INSERT and SELECT, while others that were good for UPDATE?
> We ran MySQL until a year ago but we have switched to Pg since we had
> data corruption issues. I am sure I'm missing some tricks in the Pg
> world, and would truly appreciate any ideas.
>
> If you are reading until here, well done! And so many thanks in
> advance for any insight you can shed into this matter.

What index(es) is/are on GAME and GAME_COUNTS?

What version of PG are you running?

Are you pushing the box too hard?

--
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!


Re: Database design wisdom needed

From
Andrew Sullivan
Date:
On Tue, Jun 05, 2007 at 01:46:42PM +0800, Erick Papadakis wrote:
>
>    GAME_COUNTS Table (also ~5 million rows of course)
>    ---------------------------------------------------
>    GAME_ID
>    VIEWS_COUNT
>    PLAYED_COUNT
>    PLAYED_COUNT_UNIQUE

This is a poor normalisation.  While views_count is necessarily a
superset of played_count, the two values are not functions of one
another really (for instance, there will be games that have been
viewed but never played); therefore they should be in separate
tables, I'd say.  I'd also like to point out that a 5 million row
table is actually not very big.

In order to make this fast, I'd probably implement a dirtyish hack
along the following lines.

create table game_views_summ (
    game_id int references games(id),
    views bigint,
    counted_at timestamp with time zone);

create table game_views (
    game_id int references games(id),
    viewed_at timestamp with time zone);

Then, you have a daemon that summarizes data in game_views regularly
into game_views_summ and deletes the data in game_views that's
just been updated.  If you do this more or less all the time, you can
keep the vacuums up to date and keep the bloat to a minimum.  It's
an increase in overall I/O, but it ought to be smoother than just
trying to cope with it in big lumps.  (A similar strategy will work
for the play counts.)

One thing to be careful about is that you don't end up with
game_views with very few rows, but a huge number of dead rows.  This
two-table approach can rapidly become a pessimal implementation in
the event you are too efficient at eliminating the
rows-to-be-summarized, but have a lot of dead rows that are
unrecoverable because of running transactions.  You'll end up with a
seqscan on a table that contains four live rows, except that it's
400M.  That pattern is a killer.


> 1.5. After we update the GAME_COUNTS table, we also truncate the
> GAME_PLAYED_LOG table because its records have no significance
> anymore. This hourly deletion leads to fragmentation etc in the table.

Are you truncating, or deleting?  TRUNCATE leaves no dead rows,
fragmentation, &c.  That's its point.

> (remember, we have 5,000,000 games). So every hour we end up running
> thousands of small UPDATE queries like:
>
>   update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1...

Why is this bad?  (You don't actually need thousands of these, I
think, because you ought to be able to design one query to do it all.
But I'd be unhappy with the locking, I think, given what you're
trying to do.)

> 3. Finally, the JOIN queries between the GAME table and GAME_COUNTS
> table are not very fast. They take about 1 second each, even if I do a
> LIMIT 20 in every query.

This sounds like something's wrong in your query or your plan.
EXPLAIN ANALYSE is your friend here.  I'd suspect vacuum issues.

Oh, one other thing.  I noted you're storing the player's IP address.
You do know that maps very poorly to actual individuals on the other
end, right?

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Unfortunately reformatting the Internet is a little more painful
than reformatting your hard drive when it gets out of whack.
        --Scott Morris