Thread: Warehouse Schema

Warehouse Schema

From

"Worky Workerson"

Date:

24 May 2006, 18:50:31

I'm developing a schema for a large data warehouse (10 billion records) and had a couple of questions about how to
optimizeit.  My biggest question is on the design of the fact table and how much normalization I should actually do of
thedata.  <br /><br />The data is going to be keyed by IP address, stored as a non-unique IP4 type.  Each record is
alsotagged with various attributes, such as a category and a type.  Assuming that a category and type are VARCHAR,
wouldit make sense to normalize these out of the fact table into their respective tables and key them by an INTEGER?
I.e.<br/><br />CREATE TABLE big_fact_table_A (<br />     identifier IP4,<br />     data1   BYTEA,<br />     data2  
BYTEA,<br/>     ...<br />     dataN  BYTEA,<br />     category    VARCHAR(16),<br />     type        VARCHAR(16)<br />
);<br/><br />... vs ...<br /><br />CREATE TABLE big_fact_table_B (<br />    identifier IP4,<br />    data1   BYTEA,<br
/>   data2   BYTEA,<br />    ...<br />    dataN  BYTEA,<br />    category    INTEGER REFERENCES categories
(category_id),<br />    type        INTEGER REFERENCES types (type_id)<br />);<br /><br />I figure that the normalized
facttable should be quicker, as the integer is much smaller than the varchar.  On query, however, the table will need
tobe joined against the two other tables (categories, types), but I still figure that this is a win because the other
tablesare fairly small and should stay resident in memory.  Is this reasoning valid? <br /><br />The downside to this
(frommy perspective) is that the data comes in the form of big_fact_table_A and could be easily COPYed straight into
thetable.  with big_fact_table_B it looks like I will have to do the "unJOIN" in a script.  Also, I have separate
installationsof the warehouse (with different data sources) and it will be difficult to share data between them unless
theircategories/types tables are keyed with exactly the same integer IDs which, as I don't directly control the other
installations,is not guaranteed. <br /><br />Any suggestions to the above "problems"?<br />

Re: Warehouse Schema

From

Richard Huxton

Date:

25 May 2006, 09:55:05

Worky Workerson wrote:
> I'm developing a schema for a large data warehouse (10 billion records) and
> had a couple of questions about how to optimize it.  My biggest question is
> on the design of the fact table and how much normalization I should 
> actually
> do of the data.

As much as you can and still afford the hardware to support it.

> The data is going to be keyed by IP address, stored as a non-unique IP4
> type.  Each record is also tagged with various attributes, such as a
> category and a type.  Assuming that a category and type are VARCHAR, would
> it make sense to normalize these out of the fact table into their 
> respective
> tables and key them by an INTEGER? I.e.
> 
> CREATE TABLE big_fact_table_A (
>    identifier IP4,
>    data1   BYTEA,
>    data2   BYTEA,
>    ...
>    dataN  BYTEA,
>    category    VARCHAR(16),
>    type        VARCHAR(16)
> );
> 
> ... vs ...
> 
> CREATE TABLE big_fact_table_B (
>    identifier IP4,
>    data1   BYTEA,
>    data2   BYTEA,
>    ...
>    dataN  BYTEA,
>    category    INTEGER REFERENCES categories (category_id),
>    type        INTEGER REFERENCES types (type_id)
> );

Assuming categories and types are fixed sets this isn't really 
normalisation, you're just substituting one representation with another. 
That's assuming you enforce valid representations on table_A.

Of course, if the categories or types can change then you'll want to use  foreign-keys. Try int2 rather than int4 even
(althoughcheck the 
 
manuals for details to see if it'll pack them closer).

> I figure that the normalized fact table should be quicker, as the 
> integer is
> much smaller than the varchar.  On query, however, the table will need 
> to be
> joined against the two other tables (categories, types), but I still figure
> that this is a win because the other tables are fairly small and should 
> stay
> resident in memory.  Is this reasoning valid?

That's the trade-off.

> The downside to this (from my perspective) is that the data comes in the
> form of big_fact_table_A and could be easily COPYed straight into the
> table.  with big_fact_table_B it looks like I will have to do the "unJOIN"
> in a script.  Also, I have separate installations of the warehouse (with
> different data sources) and it will be difficult to share data between them
> unless their categories/types tables are keyed with exactly the same 
> integer
> IDs which, as I don't directly control the other installations, is not
> guaranteed.
> 
> Any suggestions to the above "problems"?

If you're comfortable with a little "C" coding, write your own type. 
That's really what you're trying to do here. Accept text input and 
display text output but store in a compact form.

HTH
--   Richard Huxton  Archonet Ltd