Re: Data warehousing requirements - Mailing list pgsql-performance

From Aaron Werman
Subject Re: Data warehousing requirements
Date
Msg-id BAY18-DAV8suot0OUIH00003354@hotmail.com
Whole thread Raw
In response to Data warehousing requirements  (Gabriele Bartolini <angusgb@tin.it>)
Responses Re: Data warehousing requirements
List pgsql-performance
Consider how the fact table is going to be used, and review hacking it up
based on usage. Fact tables should be fairly narrow, so if there are extra
columns beyond keys and dimension keys consider breaking it into parallel
tables (vertical partitioning).

Horizontal partitioning is your friend; especially if it is large - consider
slicing the data into chunks. If the fact table is date driven it might be
worthwhile to break it into separate tables based on date key.  This wins in
reducing the working set of queries and in buffering. If there is a real
hotspot, such as current month's activity, you might want to keep a separate
table with just the (most) active data.Static tables of unchanged data can
simplify backups, etc., as well.

Consider summary tables if you know what type of queries you'll hit.
Especially here, MVCC is not your friend because it has extra work to do for
aggregate functions.

Cluster helps if you bulk load.

In most warehouses, the data is downstream data from existing operational
systems. Because of that you're not able to use database features to
preserve integrity. In most cases, the data goes through an
extract/transform/load process - and the output is considered acceptable.
So, no RI is correct for star or snowflake design. Pretty much no anything
else that adds intelligence - no triggers, no objects, no constraints of any
sort. Many designers try hard to avoid nulls.

On the hardware side - RAID5 might work here because of the low volume if
you can pay the write performance penalty. To size hardware you need to
estimate load in terms of transaction type (I usually make bucket categories
of small, medium, and large effort needs) and transaction rate. Then try to
estimate how much CPU and I/O they'll use.

/Aaron

"Let us not speak of them; but look, and pass on."

----- Original Message -----
From: "Gabriele Bartolini" <angusgb@tin.it>
To: <pgsql-performance@postgresql.org>
Sent: Wednesday, October 06, 2004 5:36 PM
Subject: [PERFORM] Data warehousing requirements


> Hi guys,
>
>     I just discussed about my problem on IRC. I am building a Web usage
> mining system based on Linux, PostgreSQL and C++ made up of an OLTP
> database which feeds several and multi-purpose data warehouses about
users'
> behaviour on HTTP servers.
>
>     I modelled every warehouse using the star schema, with a fact table
and
> then 'n' dimension tables linked using a surrogate ID.
>
>     Discussing with the guys of the chat, I came up with these
conclusions,
> regarding the warehouse's performance:
>
> 1) don't use referential integrity in the facts table
> 2) use INTEGER and avoid SMALLINT and NUMERIC types for dimensions' IDs
> 3) use an index for every dimension's ID in the fact table
>
>     As far as administration is concerned: run VACUUM ANALYSE daily and
> VACUUM FULL periodically.
>
>     Is there anything else I should keep in mind?
>
>     Also, I was looking for advice regarding hardware requirements for a
> data warehouse system that needs to satisfy online queries. I have indeed
> no idea at the moment. I can only predict 4 million about records a month
> in the fact table, does it make sense or not? is it too much?
>
>     Data needs to be easily backed up and eventually replicated.
>
>     Having this in mind, what hardware architecture should I look for? How
> many hard disks do I need, what kind and what RAID solution do you suggest
> me to adopt (5 or 10 - I think)?
>
> Thank you so much,
> -Gabriele
> --
> Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
> maintainer
> Current Location: Prato, Toscana, Italia
> angusgb@tin.it | http://www.prato.linux.it/~gbartolini | ICQ#129221447
>  > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
> Inferno
>


----------------------------------------------------------------------------
----


>
> ---
> Outgoing mail is certified Virus Free.
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
>


----------------------------------------------------------------------------
----


>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

pgsql-performance by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Caching of Queries
Next
From: Pierre-Frédéric Caillaud
Date:
Subject: Re: sequential scan on select distinct