Database design wisdom needed - Mailing list pgsql-general

From Erick Papadakis
Subject Database design wisdom needed
Date
Msg-id e9e8f77d0706042246m77ffe600jf9577511f52f7e79@mail.gmail.com
Whole thread Raw
Responses Re: Database design wisdom needed  (Ron Johnson <ron.l.johnson@cox.net>)
Re: Database design wisdom needed  (Andrew Sullivan <ajs@crankycanuck.ca>)
List pgsql-general
Hi

Sorry for this somewhat long email but I think it is relevant to most
people who run online databases. I am having trouble optimizing UPDATE
queries on a certain semi-large table that is only growing larger.
I've come across some very interesting thoughts from this list, so I
thought I'll post my conundrum here. Would truly appreciate any
guidance or pointers.

I have done the messy DBA's job of breaking one table into three,
which should actually have been one. Just did it to make sure the
large main table doesn't keep getting updated with small crumbs of
information.

This main table stores what users of our online service have created.
These are small Flash games, but over the last seven years, we have
collected about 5 million of these. The primary key of this table is
"GAME_ID". In our online account management interface I also need to
show each producer's creations only to himself, so of course I need a
"PRODUCER_ID".

Above that, in our service we have had three more bits of information
for each created game:

- Number of views for the game (online views)
- Number of plays of the game (online people playing)
- Number of unique plays of the game (based on ip or
subscriber_id..some games do not require login to play)

Because each of these pieces of information was individual to each
game, my original table looked like this:


    GAME Table
    -----------
    GAME_ID
    PRODUCER_ID
    VIEWS_COUNT
    PLAYED_COUNT
    PLAYED_COUNT_UNIQUE


In the beginning, everytime someone played a game online, we updated
the PLAYED_COUNT in this table. When someone viewed a game, we updated
the VIEWS_COUNT. It was simple and it worked as it reflected the
business logic.

Over time, however, I realized that while GAME_ID and PRODUCER_ID
remained basically static, the next three columns had a very high rate
of change. For example, VIEWS_COUNT would increment every second for
popular games. The PLAYED_COUNT would increment everytime someone
played, but the fact that we have "unique" into the equation means
that I had to break this away into a log table, so that I could GROUP
BY ip address and then update the columns accordingly in the GAME
table.  So I had:


    GAME_PLAYED_LOG Table
    ----------------------
    GAME_ID
    PLAYER_IP_ADDR


From here, I would select COUNT(PLAYER_IP_ADDR) and COUNT(DISTINCT
PLAYER_IP_ADDR), grouped by GAME_ID.  I would then update the main
GAME table's columns with these values. This process was done every
hour to make sure we didn't have a huge number of rows to manipulate.

Anyway, now we're realizing that we are becoming increasingly popular
and our tables are large enough for us to require optimization.
Because GAME table is mostly for static information, and is also
required to be highly available because of our online interface, I
have just taken away the VIEWS_COUNT and PLAYED_COUNT into another
table.


    GAME Table (~5 million rows)
    ----------------------------
    GAME_ID
    PRODUCER_ID


    GAME_COUNTS Table (also ~5 million rows of course)
    ---------------------------------------------------
    GAME_ID
    VIEWS_COUNT
    PLAYED_COUNT
    PLAYED_COUNT_UNIQUE


This way, from our GAME_PLAYED_LOG, we do not need to update the main
GAME table every hour, but only the GAME_COUNTS. This leaves the GAME
table free to do its job. That is our main table, so keeping it static
was our priority.

My problems:

1. Because we are a popular website, the GAME_PLAYED_LOG table grows
at massive rates every hour. About 1,000,000 records. Doing COUNT
queries on this table already is pretty resource hogging, even if we
do them every hour. Especially the DISTINCT grouping to get unique
played count.

1.5. After we update the GAME_COUNTS table, we also truncate the
GAME_PLAYED_LOG table because its records have no significance
anymore. This hourly deletion leads to fragmentation etc in the table.

2. The UPDATE queries to GAME_COUNTS table are also a problem. The
sheer number of UPDATE queries is massive. Of the 1,000,000 views
every hour, many are just one or two views of many, many games
(remember, we have 5,000,000 games). So every hour we end up running
thousands of small UPDATE queries like:

   update GAME_COUNTS set VIEWS_COUNT = VIEWS_COUNT + 3, PLAYED_COUNT + 1...

3. Finally, the JOIN queries between the GAME table and GAME_COUNTS
table are not very fast. They take about 1 second each, even if I do a
LIMIT 20 in every query.


Now I suspect this scenario is *very* common in online databases that
must report viewer statistics. How are you guys doing it? What am I
missing? Isn't the decoupling of the static information from the more
frequently updated information a good idea? Would using a different
storage engine for different kinds of tables help -- engines that were
better for INSERT and SELECT, while others that were good for UPDATE?
We ran MySQL until a year ago but we have switched to Pg since we had
data corruption issues. I am sure I'm missing some tricks in the Pg
world, and would truly appreciate any ideas.

If you are reading until here, well done! And so many thanks in
advance for any insight you can shed into this matter.

.EP

pgsql-general by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: $libdir
Next
From: Gregory Stark
Date:
Subject: Re: There can be only one! How to avoid the "highlander-problem".