Home > mailing lists

Re: Data warehousing requirements - Mailing list pgsql-performance

From	Gabriele Bartolini
Subject	Re: Data warehousing requirements
Date	October 7, 2004 21:11:26
Msg-id	6.1.2.0.2.20041007183909.0201f310@box.tin.it Whole thread Raw
In response to	Re: Data warehousing requirements ("Aaron Werman" <awerman2@hotmail.com>)
Responses	Re: Data warehousing requirements
List	pgsql-performance

Tree view

At 13.30 07/10/2004, Aaron Werman wrote:
>Consider how the fact table is going to be used, and review hacking it up
>based on usage. Fact tables should be fairly narrow, so if there are extra
>columns beyond keys and dimension keys consider breaking it into parallel
>tables (vertical partitioning).

Hmm ... I have only an extra column. Sorry if I ask you to confirm this,
but practically vertical partitioning allows me to divide a table into 2
tables (like if I cut them vertically, right?) having the same key. If I
had 2 extra columns, that could be the case, couldn't it?

>Horizontal partitioning is your friend; especially if it is large - consider
>slicing the data into chunks. If the fact table is date driven it might be
>worthwhile to break it into separate tables based on date key.  This wins in
>reducing the working set of queries and in buffering. If there is a real
>hotspot, such as current month's activity, you might want to keep a separate
>table with just the (most) active data.Static tables of unchanged data can
>simplify backups, etc., as well.

In this case, you mean I can chunk data into: "facts_04_08" for the august
2004 facts. Is this the case?

Otherwise, is it right my point of view that I can get good results by
using a different approach, based on mixing vertical partitioning and the
CLUSTER facility of PostgreSQL? Can I vertically partition also dimension
keys from the fact table or not?

However, this subject is awesome and interesting. Far out ... data
warehousing seems to be really continous modeling, doesn't it! :-)

>Consider summary tables if you know what type of queries you'll hit.

At this stage, I can't predict it yet. But of course I need some sort of
summary. I will keep it in mind.

>Especially here, MVCC is not your friend because it has extra work to do for
>aggregate functions.

Why does it have extra work? Do you mind being more precise, Aaron? It is
really interesting. (thanks)

>Cluster helps if you bulk load.

Is it maybe because I can update or build them once the load operation has
finished?

>In most warehouses, the data is downstream data from existing operational
>systems.

That's my case too.

>Because of that you're not able to use database features to
>preserve integrity. In most cases, the data goes through an
>extract/transform/load process - and the output is considered acceptable.
>So, no RI is correct for star or snowflake design. Pretty much no anything
>else that adds intelligence - no triggers, no objects, no constraints of any
>sort. Many designers try hard to avoid nulls.

That's another interesting argument. Again, I had in mind the space
efficiency principle and I decided to use null IDs for dimension tables if
I don't have the information. I noticed though that in those cases I can't
use any index and performances result very poor.

I have a dimension table 'categories' referenced through the 'id_category'
field in the facts table. I decided to set it to NULL in case I don't have
any category to associate to it. I believe it is better to set a '0' value
if I don't have any category, allowing me not to use a "SELECT * from facts
where id_category IS NULL" which does not use the INDEX I had previously
created on that field.

>On the hardware side - RAID5 might work here because of the low volume if
>you can pay the write performance penalty. To size hardware you need to
>estimate load in terms of transaction type (I usually make bucket categories
>of small, medium, and large effort needs) and transaction rate. Then try to
>estimate how much CPU and I/O they'll use.

Thank you so much again Aaron. Your contribution has been really important
to me.

Ciao,
-Gabriele

>"Let us not speak of them; but look, and pass on."

P.S.: Dante rules ... :-)

--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Prato, Toscana, Italia
angusgb@tin.it | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004

pgsql-performance by date:

From: Greg Stark
Date: 07 October 2004, 21:08:21
Subject: Re: sequential scan on select distinct

From: Ole Langbehn
Date: 07 October 2004, 21:19:32
Subject: Re: sequential scan on select distinct

Re: Data warehousing requirements - Mailing list pgsql-performance

Previous

Next