Thread: Normalization or Performance

Normalization or Performance

From

Alvaro Nunes Melo

Date:

02 December 2004, 17:06:17

Hi,

Before writing this mail, I'd researched a little about this topic, and
got some opinions from guys like Fabien Pascal, who argues that logical
design should be separated from physical design, and other sources. As
this is not fact, I'm writing to you guys, that make things work in real
world.

We started our first big (for our company standards) project always
thinking in normalization. But once we imported legacy data into the DB,
things got harder.

One example is the clients status. A client might be active, inactive or
pending (for many reasons). We store all the status a client have since
it is in the system. To check what is the actual status of a client, we
get the last status from this historical status table. This take a
considerable time, so our best results were achieved building  a
function that checks the status and indexing this function. The problem
is that indexed functions mus bu immutable, so as you can figure, if the
status change after the creation of the index, the retunr of the
function is still the same.

What do you suggest for situations like this? Should I add a field to
clients table and store its actual status, keep storing data in the
historical table an control its changes with a trigger?

There are other situations that are making things difficult to us. For
example, one query must return the total amount a client bought in the
last 90 days. It's taking too long, when we must know it for many
clients, many times. So should I create summarization tables to store
this kind of stuff, update it with a trigger in daily basis (for
example), and solve this problem with one join?

Our database is not that big. The larger table has about 7.000.000 rows.
About 50.000 clients, half of them active. All that I'd point out above
uses indexes for queries, etc. But even with this it's not been fast
enough. We have a Dell server for this (I know, the Dell issue), a Dual
Xeon 2.8, SCSI HD, 1 GB mem. Do we need better hardware for our system?

--
+---------------------------------------------------+
|  Alvaro Nunes Melo    Atua Sistemas de Informacao |
| al_nunes@atua.com.br        www.atua.com.br       |
|    UIN - 42722678            (54) 327-1044        |
+---------------------------------------------------+

Re: Normalization or Performance

From

Richard Huxton

Date:

02 December 2004, 18:35:14

Alvaro Nunes Melo wrote:
> Hi,
>
> Before writing this mail, I'd researched a little about this topic,
> and got some opinions from guys like Fabien Pascal, who argues that
> logical design should be separated from physical design, and other
> sources. As this is not fact, I'm writing to you guys, that make
> things work in real world.

I believe he's right. Or at least that you should only compromise your
logical design once it becomes absolutely necessary due to physical
limitations.

> We started our first big (for our company standards) project always
> thinking in normalization. But once we imported legacy data into the
> DB, things got harder.
>
> One example is the clients status. A client might be active, inactive
> or pending (for many reasons). We store all the status a client have
> since it is in the system. To check what is the actual status of a
> client, we get the last status from this historical status table.
> This take a considerable time, so our best results were achieved
> building  a function that checks the status and indexing this
> function. The problem is that indexed functions mus bu immutable, so
> as you can figure, if the status change after the creation of the
> index, the retunr of the function is still the same.
>
> What do you suggest for situations like this? Should I add a field to
>  clients table and store its actual status, keep storing data in the
> historical table an control its changes with a trigger?

Trigger + history table is a common solution, it's easy to implement and
there's nothing non-relational about it as a solution.

> There are other situations that are making things difficult to us.
> For example, one query must return the total amount a client bought
> in the last 90 days. It's taking too long, when we must know it for
> many clients, many times. So should I create summarization tables to
> store this kind of stuff, update it with a trigger in daily basis
> (for example), and solve this problem with one join?

One solution I use for this sort of thing is a summary table grouped by
date, and accurate until the start of today. Then, I check the summary
table and the "live" table for todays information and sum those.

> Our database is not that big. The larger table has about 7.000.000
> rows. About 50.000 clients, half of them active. All that I'd point
> out above uses indexes for queries, etc. But even with this it's not
> been fast enough. We have a Dell server for this (I know, the Dell
> issue), a Dual Xeon 2.8, SCSI HD, 1 GB mem. Do we need better
> hardware for our system?

Swap one of your processors for more RAM and disks, perhaps.

--
   Richard Huxton
   Archonet Ltd

Re: Normalization or Performance

From

"Jim C. Nasby"

Date:

02 December 2004, 23:04:35

On Thu, Dec 02, 2004 at 03:05:55PM -0200, Alvaro Nunes Melo wrote:
> Hi,
>
> Before writing this mail, I'd researched a little about this topic, and
> got some opinions from guys like Fabien Pascal, who argues that logical
> design should be separated from physical design, and other sources. As
> this is not fact, I'm writing to you guys, that make things work in real
> world.
>
> We started our first big (for our company standards) project always
> thinking in normalization. But once we imported legacy data into the DB,
> things got harder.
>
> One example is the clients status. A client might be active, inactive or
> pending (for many reasons). We store all the status a client have since
> it is in the system. To check what is the actual status of a client, we
> get the last status from this historical status table. This take a
> considerable time, so our best results were achieved building  a
> function that checks the status and indexing this function. The problem
> is that indexed functions mus bu immutable, so as you can figure, if the
> status change after the creation of the index, the retunr of the
> function is still the same.
>
> What do you suggest for situations like this? Should I add a field to
> clients table and store its actual status, keep storing data in the
> historical table an control its changes with a trigger?

It seems you shouldn't have to resort to this. SELECT status FROM
client_status WHERE client_id = blah ORDER BY status_date DESC LIMIT 1
should be pretty fast given an index on client_id, status_date (which
should be able to be unique).

> There are other situations that are making things difficult to us. For
> example, one query must return the total amount a client bought in the
> last 90 days. It's taking too long, when we must know it for many
> clients, many times. So should I create summarization tables to store
> this kind of stuff, update it with a trigger in daily basis (for
> example), and solve this problem with one join?

This sounds like a more likely candidate for a summary table, though you
might not want to use a trigger. Unless you need absolutely up-to-date
information it seems like a nightly process to update the totals would
be better and more efficient.

> Our database is not that big. The larger table has about 7.000.000 rows.
> About 50.000 clients, half of them active. All that I'd point out above
> uses indexes for queries, etc. But even with this it's not been fast
> enough. We have a Dell server for this (I know, the Dell issue), a Dual
> Xeon 2.8, SCSI HD, 1 GB mem. Do we need better hardware for our system?

Is all this on a single HD? That's going to be a huge bottleneck. You'll
be much better off with a mirrored partition for your WAL files and
either raid5 or raid10 for the database itself. You'd probably be better
off with more memory as well. If you're going to buy a new box instead
of upgrade your existing one, I'd recommend going with an Opteron
because of it's much better memory bandwidth.

For reference, stats.distributed.net is a dual Opteron 244 1.8GHz with
4G ram, a 200G mirror for WAL and the system files and a 6x200G RAID10
for the database (all SATA drives). The largest table 120M rows and
825,000 8k pages. I can scan 1/5th of that table via an index scan in
about a minute. (The schema can be found at
http://minilink.org/cvs.distributed.net/l3.sql.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Normalization or Performance

From

"Iain"

Date:

03 December 2004, 01:43:21

Hi,

without knowing much about your system, it seems to me that the current
status of a client should be represented by a status code on the client
record. History is the list of *past* status codes. The full history,
including the current status of a client would be obtained using a union.

I had a situation which might have some parallels with yours. nthis case the
tables represented orders and receivals. The status of an order item was
held on a code in the receival itemss table (an order item can be received
many times). This is seems to me to be not normalized as the status was
actually the status of the order item, not the receival. The receival just
caused the status of the order item to change. This arrangement required
ridiculously complex sql and resulted in poor performance. Moving the status
code to the order item and implementing a simple trigger on the receival
items table cleaned things up significantly.

To put it simply, if the current status of an order item is a simple
attribute of the order item, then it should be in the order item table. The
same might be said for your client.

This is just my personal opinion though and I'm always open to alternative
opinions, as I think you are.

regards
Iain
----- Original Message -----
From: "Alvaro Nunes Melo" <al_nunes@atua.com.br>
To: "Pgsql-Performance" <pgsql-performance@postgresql.org>
Sent: Friday, December 03, 2004 2:05 AM
Subject: [PERFORM] Normalization or Performance


> Hi,
>
> Before writing this mail, I'd researched a little about this topic, and
> got some opinions from guys like Fabien Pascal, who argues that logical
> design should be separated from physical design, and other sources. As
> this is not fact, I'm writing to you guys, that make things work in real
> world.
>
> We started our first big (for our company standards) project always
> thinking in normalization. But once we imported legacy data into the DB,
> things got harder.
>
> One example is the clients status. A client might be active, inactive or
> pending (for many reasons). We store all the status a client have since
> it is in the system. To check what is the actual status of a client, we
> get the last status from this historical status table. This take a
> considerable time, so our best results were achieved building  a
> function that checks the status and indexing this function. The problem
> is that indexed functions mus bu immutable, so as you can figure, if the
> status change after the creation of the index, the retunr of the
> function is still the same.
>
> What do you suggest for situations like this? Should I add a field to
> clients table and store its actual status, keep storing data in the
> historical table an control its changes with a trigger?
>
> There are other situations that are making things difficult to us. For
> example, one query must return the total amount a client bought in the
> last 90 days. It's taking too long, when we must know it for many
> clients, many times. So should I create summarization tables to store
> this kind of stuff, update it with a trigger in daily basis (for
> example), and solve this problem with one join?
>
> Our database is not that big. The larger table has about 7.000.000 rows.
> About 50.000 clients, half of them active. All that I'd point out above
> uses indexes for queries, etc. But even with this it's not been fast
> enough. We have a Dell server for this (I know, the Dell issue), a Dual
> Xeon 2.8, SCSI HD, 1 GB mem. Do we need better hardware for our system?
>
> --
> +---------------------------------------------------+
> |  Alvaro Nunes Melo    Atua Sistemas de Informacao |
> | al_nunes@atua.com.br        www.atua.com.br       |
> |    UIN - 42722678            (54) 327-1044        |
> +---------------------------------------------------+
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend