Re: Warehouse Schema - Mailing list pgsql-sql

From Richard Huxton
Subject Re: Warehouse Schema
Date
Msg-id 44757EE4.8040807@archonet.com
Whole thread Raw
In response to Warehouse Schema  ("Worky Workerson" <worky.workerson@gmail.com>)
List pgsql-sql
Worky Workerson wrote:
> I'm developing a schema for a large data warehouse (10 billion records) and
> had a couple of questions about how to optimize it.  My biggest question is
> on the design of the fact table and how much normalization I should 
> actually
> do of the data.

As much as you can and still afford the hardware to support it.

> The data is going to be keyed by IP address, stored as a non-unique IP4
> type.  Each record is also tagged with various attributes, such as a
> category and a type.  Assuming that a category and type are VARCHAR, would
> it make sense to normalize these out of the fact table into their 
> respective
> tables and key them by an INTEGER? I.e.
> 
> CREATE TABLE big_fact_table_A (
>    identifier IP4,
>    data1   BYTEA,
>    data2   BYTEA,
>    ...
>    dataN  BYTEA,
>    category    VARCHAR(16),
>    type        VARCHAR(16)
> );
> 
> ... vs ...
> 
> CREATE TABLE big_fact_table_B (
>    identifier IP4,
>    data1   BYTEA,
>    data2   BYTEA,
>    ...
>    dataN  BYTEA,
>    category    INTEGER REFERENCES categories (category_id),
>    type        INTEGER REFERENCES types (type_id)
> );

Assuming categories and types are fixed sets this isn't really 
normalisation, you're just substituting one representation with another. 
That's assuming you enforce valid representations on table_A.

Of course, if the categories or types can change then you'll want to use  foreign-keys. Try int2 rather than int4 even
(althoughcheck the 
 
manuals for details to see if it'll pack them closer).

> I figure that the normalized fact table should be quicker, as the 
> integer is
> much smaller than the varchar.  On query, however, the table will need 
> to be
> joined against the two other tables (categories, types), but I still figure
> that this is a win because the other tables are fairly small and should 
> stay
> resident in memory.  Is this reasoning valid?

That's the trade-off.

> The downside to this (from my perspective) is that the data comes in the
> form of big_fact_table_A and could be easily COPYed straight into the
> table.  with big_fact_table_B it looks like I will have to do the "unJOIN"
> in a script.  Also, I have separate installations of the warehouse (with
> different data sources) and it will be difficult to share data between them
> unless their categories/types tables are keyed with exactly the same 
> integer
> IDs which, as I don't directly control the other installations, is not
> guaranteed.
> 
> Any suggestions to the above "problems"?

If you're comfortable with a little "C" coding, write your own type. 
That's really what you're trying to do here. Accept text input and 
display text output but store in a compact form.

HTH
--   Richard Huxton  Archonet Ltd


pgsql-sql by date:

Previous
From: "Worky Workerson"
Date:
Subject: Warehouse Schema
Next
From: "andi"
Date:
Subject: Re: PLEASE help ME , HOW TO GENERATE PRIMARY Keys on the fly