Thread: Data warehousing requirements

Data warehousing requirements

From
Gabriele Bartolini
Date:
Hi guys,

    I just discussed about my problem on IRC. I am building a Web usage
mining system based on Linux, PostgreSQL and C++ made up of an OLTP
database which feeds several and multi-purpose data warehouses about users'
behaviour on HTTP servers.

    I modelled every warehouse using the star schema, with a fact table and
then 'n' dimension tables linked using a surrogate ID.

    Discussing with the guys of the chat, I came up with these conclusions,
regarding the warehouse's performance:

1) don't use referential integrity in the facts table
2) use INTEGER and avoid SMALLINT and NUMERIC types for dimensions' IDs
3) use an index for every dimension's ID in the fact table

    As far as administration is concerned: run VACUUM ANALYSE daily and
VACUUM FULL periodically.

    Is there anything else I should keep in mind?

    Also, I was looking for advice regarding hardware requirements for a
data warehouse system that needs to satisfy online queries. I have indeed
no idea at the moment. I can only predict 4 million about records a month
in the fact table, does it make sense or not? is it too much?

    Data needs to be easily backed up and eventually replicated.

    Having this in mind, what hardware architecture should I look for? How
many hard disks do I need, what kind and what RAID solution do you suggest
me to adopt (5 or 10 - I think)?

Thank you so much,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Prato, Toscana, Italia
angusgb@tin.it | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004

Re: Data warehousing requirements

From
"Aaron Werman"
Date:
Consider how the fact table is going to be used, and review hacking it up
based on usage. Fact tables should be fairly narrow, so if there are extra
columns beyond keys and dimension keys consider breaking it into parallel
tables (vertical partitioning).

Horizontal partitioning is your friend; especially if it is large - consider
slicing the data into chunks. If the fact table is date driven it might be
worthwhile to break it into separate tables based on date key.  This wins in
reducing the working set of queries and in buffering. If there is a real
hotspot, such as current month's activity, you might want to keep a separate
table with just the (most) active data.Static tables of unchanged data can
simplify backups, etc., as well.

Consider summary tables if you know what type of queries you'll hit.
Especially here, MVCC is not your friend because it has extra work to do for
aggregate functions.

Cluster helps if you bulk load.

In most warehouses, the data is downstream data from existing operational
systems. Because of that you're not able to use database features to
preserve integrity. In most cases, the data goes through an
extract/transform/load process - and the output is considered acceptable.
So, no RI is correct for star or snowflake design. Pretty much no anything
else that adds intelligence - no triggers, no objects, no constraints of any
sort. Many designers try hard to avoid nulls.

On the hardware side - RAID5 might work here because of the low volume if
you can pay the write performance penalty. To size hardware you need to
estimate load in terms of transaction type (I usually make bucket categories
of small, medium, and large effort needs) and transaction rate. Then try to
estimate how much CPU and I/O they'll use.

/Aaron

"Let us not speak of them; but look, and pass on."

----- Original Message -----
From: "Gabriele Bartolini" <angusgb@tin.it>
To: <pgsql-performance@postgresql.org>
Sent: Wednesday, October 06, 2004 5:36 PM
Subject: [PERFORM] Data warehousing requirements


> Hi guys,
>
>     I just discussed about my problem on IRC. I am building a Web usage
> mining system based on Linux, PostgreSQL and C++ made up of an OLTP
> database which feeds several and multi-purpose data warehouses about
users'
> behaviour on HTTP servers.
>
>     I modelled every warehouse using the star schema, with a fact table
and
> then 'n' dimension tables linked using a surrogate ID.
>
>     Discussing with the guys of the chat, I came up with these
conclusions,
> regarding the warehouse's performance:
>
> 1) don't use referential integrity in the facts table
> 2) use INTEGER and avoid SMALLINT and NUMERIC types for dimensions' IDs
> 3) use an index for every dimension's ID in the fact table
>
>     As far as administration is concerned: run VACUUM ANALYSE daily and
> VACUUM FULL periodically.
>
>     Is there anything else I should keep in mind?
>
>     Also, I was looking for advice regarding hardware requirements for a
> data warehouse system that needs to satisfy online queries. I have indeed
> no idea at the moment. I can only predict 4 million about records a month
> in the fact table, does it make sense or not? is it too much?
>
>     Data needs to be easily backed up and eventually replicated.
>
>     Having this in mind, what hardware architecture should I look for? How
> many hard disks do I need, what kind and what RAID solution do you suggest
> me to adopt (5 or 10 - I think)?
>
> Thank you so much,
> -Gabriele
> --
> Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
> maintainer
> Current Location: Prato, Toscana, Italia
> angusgb@tin.it | http://www.prato.linux.it/~gbartolini | ICQ#129221447
>  > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
> Inferno
>


----------------------------------------------------------------------------
----


>
> ---
> Outgoing mail is certified Virus Free.
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
>


----------------------------------------------------------------------------
----


>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

Re: Data warehousing requirements

From
Gabriele Bartolini
Date:
At 13.30 07/10/2004, Aaron Werman wrote:
>Consider how the fact table is going to be used, and review hacking it up
>based on usage. Fact tables should be fairly narrow, so if there are extra
>columns beyond keys and dimension keys consider breaking it into parallel
>tables (vertical partitioning).

Hmm ... I have only an extra column. Sorry if I ask you to confirm this,
but practically vertical partitioning allows me to divide a table into 2
tables (like if I cut them vertically, right?) having the same key. If I
had 2 extra columns, that could be the case, couldn't it?

>Horizontal partitioning is your friend; especially if it is large - consider
>slicing the data into chunks. If the fact table is date driven it might be
>worthwhile to break it into separate tables based on date key.  This wins in
>reducing the working set of queries and in buffering. If there is a real
>hotspot, such as current month's activity, you might want to keep a separate
>table with just the (most) active data.Static tables of unchanged data can
>simplify backups, etc., as well.

In this case, you mean I can chunk data into: "facts_04_08" for the august
2004 facts. Is this the case?

Otherwise, is it right my point of view that I can get good results by
using a different approach, based on mixing vertical partitioning and the
CLUSTER facility of PostgreSQL? Can I vertically partition also dimension
keys from the fact table or not?

However, this subject is awesome and interesting. Far out ... data
warehousing seems to be really continous modeling, doesn't it! :-)

>Consider summary tables if you know what type of queries you'll hit.

At this stage, I can't predict it yet. But of course I need some sort of
summary. I will keep it in mind.

>Especially here, MVCC is not your friend because it has extra work to do for
>aggregate functions.

Why does it have extra work? Do you mind being more precise, Aaron? It is
really interesting. (thanks)

>Cluster helps if you bulk load.

Is it maybe because I can update or build them once the load operation has
finished?

>In most warehouses, the data is downstream data from existing operational
>systems.

That's my case too.

>Because of that you're not able to use database features to
>preserve integrity. In most cases, the data goes through an
>extract/transform/load process - and the output is considered acceptable.
>So, no RI is correct for star or snowflake design. Pretty much no anything
>else that adds intelligence - no triggers, no objects, no constraints of any
>sort. Many designers try hard to avoid nulls.

That's another interesting argument. Again, I had in mind the space
efficiency principle and I decided to use null IDs for dimension tables if
I don't have the information. I noticed though that in those cases I can't
use any index and performances result very poor.

I have a dimension table 'categories' referenced through the 'id_category'
field in the facts table. I decided to set it to NULL in case I don't have
any category to associate to it. I believe it is better to set a '0' value
if I don't have any category, allowing me not to use a "SELECT * from facts
where id_category IS NULL" which does not use the INDEX I had previously
created on that field.

>On the hardware side - RAID5 might work here because of the low volume if
>you can pay the write performance penalty. To size hardware you need to
>estimate load in terms of transaction type (I usually make bucket categories
>of small, medium, and large effort needs) and transaction rate. Then try to
>estimate how much CPU and I/O they'll use.

Thank you so much again Aaron. Your contribution has been really important
to me.

Ciao,
-Gabriele

>"Let us not speak of them; but look, and pass on."

P.S.: Dante rules ... :-)

--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Prato, Toscana, Italia
angusgb@tin.it | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004

Re: Data warehousing requirements

From
Josh Berkus
Date:
Gabriele,

> That's another interesting argument. Again, I had in mind the space
> efficiency principle and I decided to use null IDs for dimension tables if
> I don't have the information. I noticed though that in those cases I can't
> use any index and performances result very poor.

For one thing, this is false optimization; a NULL isn't saving you any table
size on an INT or BIGINT column.    NULLs are only smaller on variable-width
columns.  If you're going to start counting bytes, make sure it's an informed
count.

More importantly, you should never, ever allow null FKs on a star-topology
database.    LEFT OUTER JOINs are vastly less efficient than INNER JOINs in a
query, and the difference between having 20 outer joins for your data view,
vs 20 regular joins, can easily be a difference of 100x in execution time.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Data warehousing requirements

From
"Aaron Werman"
Date:
----- Original Message -----
From: "Gabriele Bartolini" <angusgb@tin.it>
To: "Aaron Werman" <awerman2@hotmail.com>;
<pgsql-performance@postgresql.org>
Sent: Thursday, October 07, 2004 1:07 PM
Subject: Re: [PERFORM] Data warehousing requirements


> At 13.30 07/10/2004, Aaron Werman wrote:
> >Consider how the fact table is going to be used, and review hacking it up
> >based on usage. Fact tables should be fairly narrow, so if there are
extra
> >columns beyond keys and dimension keys consider breaking it into parallel
> >tables (vertical partitioning).
>
> Hmm ... I have only an extra column. Sorry if I ask you to confirm this,
> but practically vertical partitioning allows me to divide a table into 2
> tables (like if I cut them vertically, right?) having the same key. If I
> had 2 extra columns, that could be the case, couldn't it?

Yes - it's splitting a table's columns and copying the PK. If you have only
one column and it's narrow - partitioning becomes harder to justify.

>
> >Horizontal partitioning is your friend; especially if it is large -
consider
> >slicing the data into chunks. If the fact table is date driven it might
be
> >worthwhile to break it into separate tables based on date key.  This wins
in
> >reducing the working set of queries and in buffering. If there is a real
> >hotspot, such as current month's activity, you might want to keep a
separate
> >table with just the (most) active data.Static tables of unchanged data
can
> >simplify backups, etc., as well.
>
> In this case, you mean I can chunk data into: "facts_04_08" for the august
> 2004 facts. Is this the case?

Exactly. The problem is when you need to query across the chunks. There was
a discussion here of creating views ala

create view facts as
 select * from facts_04_07 where datekey between '01/07/2004' and
'31/07/2004'
 union all
 select * from facts_04_08 where datekey between '01/08/2004' and
'31/08/2004'
 union all
 select * from facts_04_09 where datekey between '01/09/2004' and
'30/09/2004'
 ...

hoping the restrictions would help the planner prune chunks out. Has anyone
tried this?

>
> Otherwise, is it right my point of view that I can get good results by
> using a different approach, based on mixing vertical partitioning and the
> CLUSTER facility of PostgreSQL? Can I vertically partition also dimension
> keys from the fact table or not?

If you can do that, you probably should beyond a star schema. The standard
definition of a star schema is a single very large fact table with very
small dimension tables.  The point of a star is that it can be used to
efficiantly restrict results out by merging the dimensional restrictions and
only extracting matches from the fact table. E.g.,

select
    count(*)
from
    people_fact, /* 270M */
    states_dim, /* only 50 something */
    gender_dim, /* 2 */
    age_dim /* say 115 */
where
    age_dim.age > 65
  and
    gender_dim.gender = 'F'
  and
    states_dim.state_code in ('PR', 'ME')
  and
    age_dim.age_key = people_fact.age_key
  and
    gender_dim.gender_key = people_fact.gender_key
  and
    states_dim.state_key = people_fact.state_key

(I had to write out this trivial query because most DBAs don't realize going
in how ugly star queries are.) If you split the fact table so ages were in a
vertical partition you would optimize queries which didn't use the age data,
but if you needed the age data, you would have to join two large tables -
which is not a star query.

What you're thinking about on the cluster front is fun. You can split groups
of dimension keys off to seperate vertical partitions, but you can only
cluster each on a single key. So you need to split each one off, which
results in your inventing the index! (-:

>
> However, this subject is awesome and interesting. Far out ... data
> warehousing seems to be really continous modeling, doesn't it! :-)
>
> >Consider summary tables if you know what type of queries you'll hit.
>
> At this stage, I can't predict it yet. But of course I need some sort of
> summary. I will keep it in mind.
>
> >Especially here, MVCC is not your friend because it has extra work to do
for
> >aggregate functions.
>
> Why does it have extra work? Do you mind being more precise, Aaron? It is
> really interesting. (thanks)

The standard reasons - that a lot of queries that seem intuitively to be
resolvable statically or through indices have to walk the data to find
current versions. Keeping aggregates (especially if you can allow them to be
slightly stale) can reduce lots of reads. A big goal of horizontal
partitioning is to give the planner some way of reducing the query scope.

>
> >Cluster helps if you bulk load.
>
> Is it maybe because I can update or build them once the load operation has
> finished?

If you have streaming loads, clustering can be a pain to implement well.

>
> >In most warehouses, the data is downstream data from existing operational
> >systems.
>
> That's my case too.
>
> >Because of that you're not able to use database features to
> >preserve integrity. In most cases, the data goes through an
> >extract/transform/load process - and the output is considered acceptable.
> >So, no RI is correct for star or snowflake design. Pretty much no
anything
> >else that adds intelligence - no triggers, no objects, no constraints of
any
> >sort. Many designers try hard to avoid nulls.
>
> That's another interesting argument. Again, I had in mind the space
> efficiency principle and I decided to use null IDs for dimension tables if
> I don't have the information. I noticed though that in those cases I can't
> use any index and performances result very poor.
>
> I have a dimension table 'categories' referenced through the 'id_category'
> field in the facts table. I decided to set it to NULL in case I don't have
> any category to associate to it. I believe it is better to set a '0' value
> if I don't have any category, allowing me not to use a "SELECT * from
facts
> where id_category IS NULL" which does not use the INDEX I had previously
> created on that field.

(Sorry for being a pain in the neck, but BTW - that is not a star query; it
should be

SELECT
    facts.*
from
    facts,
    id_dim
where
    facts.id_key = id_dim.id_key
  and
    id_dim.id_category IS NULL

[and it really gets to the whole problem of indexing low cardinality
fields])


>
> >On the hardware side - RAID5 might work here because of the low volume if
> >you can pay the write performance penalty. To size hardware you need to
> >estimate load in terms of transaction type (I usually make bucket
categories
> >of small, medium, and large effort needs) and transaction rate. Then try
to
> >estimate how much CPU and I/O they'll use.
>
> Thank you so much again Aaron. Your contribution has been really important
> to me.
>
> Ciao,
> -Gabriele
>
> >"Let us not speak of them; but look, and pass on."
>
> P.S.: Dante rules ... :-)

:-)

that quote was not a reference to anyone in this group!

Good luck,
/Aaron

>
> --
> Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
> maintainer
> Current Location: Prato, Toscana, Italia
> angusgb@tin.it | http://www.prato.linux.it/~gbartolini | ICQ#129221447
>  > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
> Inferno
>

Re: Data warehousing requirements

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> For one thing, this is false optimization; a NULL isn't saving you any table
> size on an INT or BIGINT column.    NULLs are only smaller on variable-width
> columns.

Uh ... not true.  The column will not be stored, either way.  Now if you
had a row that otherwise had no nulls, the first null in the column will
cause a null-columns-bitmap to be added, which might more than eat up
the savings from storing a single int or bigint.  But after the first
null, each additional null in a row is a win, free-and-clear, whether
it's fixed-width or not.

(There are also some alignment considerations that might cause the
savings to vanish.)

> More importantly, you should never, ever allow null FKs on a star-topology
> database.  LEFT OUTER JOINs are vastly less efficient than INNER JOINs in a
> query, and the difference between having 20 outer joins for your data view,
> vs 20 regular joins, can easily be a difference of 100x in execution time.

It's not so much that they are necessarily inefficient as that they
constrain the planner's freedom of action.  You need to think a lot more
carefully about the order of joining than when you use inner joins.

            regards, tom lane

Re: Data warehousing requirements

From
Josh Berkus
Date:
Tom,

Well, I sit corrected.   Obviously I misread that.

> It's not so much that they are necessarily inefficient as that they
> constrain the planner's freedom of action.  You need to think a lot more
> carefully about the order of joining than when you use inner joins.

I've also found that OUTER JOINS constrain the types of joins that can/will be
used as well as the order.  Maybe you didn't intend it that way, but (for
example) OUTER JOINs seem much more likely to use expensive merge joins.

--
Josh Berkus
Aglio Database Solutions
San Francisco