Worky Workerson wrote:
> I'm developing a schema for a large data warehouse (10 billion records) and
> had a couple of questions about how to optimize it. My biggest question is
> on the design of the fact table and how much normalization I should
> actually
> do of the data.
As much as you can and still afford the hardware to support it.
> The data is going to be keyed by IP address, stored as a non-unique IP4
> type. Each record is also tagged with various attributes, such as a
> category and a type. Assuming that a category and type are VARCHAR, would
> it make sense to normalize these out of the fact table into their
> respective
> tables and key them by an INTEGER? I.e.
>
> CREATE TABLE big_fact_table_A (
> identifier IP4,
> data1 BYTEA,
> data2 BYTEA,
> ...
> dataN BYTEA,
> category VARCHAR(16),
> type VARCHAR(16)
> );
>
> ... vs ...
>
> CREATE TABLE big_fact_table_B (
> identifier IP4,
> data1 BYTEA,
> data2 BYTEA,
> ...
> dataN BYTEA,
> category INTEGER REFERENCES categories (category_id),
> type INTEGER REFERENCES types (type_id)
> );
Assuming categories and types are fixed sets this isn't really
normalisation, you're just substituting one representation with another.
That's assuming you enforce valid representations on table_A.
Of course, if the categories or types can change then you'll want to use foreign-keys. Try int2 rather than int4 even
(althoughcheck the
manuals for details to see if it'll pack them closer).
> I figure that the normalized fact table should be quicker, as the
> integer is
> much smaller than the varchar. On query, however, the table will need
> to be
> joined against the two other tables (categories, types), but I still figure
> that this is a win because the other tables are fairly small and should
> stay
> resident in memory. Is this reasoning valid?
That's the trade-off.
> The downside to this (from my perspective) is that the data comes in the
> form of big_fact_table_A and could be easily COPYed straight into the
> table. with big_fact_table_B it looks like I will have to do the "unJOIN"
> in a script. Also, I have separate installations of the warehouse (with
> different data sources) and it will be difficult to share data between them
> unless their categories/types tables are keyed with exactly the same
> integer
> IDs which, as I don't directly control the other installations, is not
> guaranteed.
>
> Any suggestions to the above "problems"?
If you're comfortable with a little "C" coding, write your own type.
That's really what you're trying to do here. Accept text input and
display text output but store in a compact form.
HTH
-- Richard Huxton Archonet Ltd