Re: Warehouse Schema - Mailing list pgsql-sql
From | Richard Huxton |
---|---|
Subject | Re: Warehouse Schema |
Date | |
Msg-id | 44757EE4.8040807@archonet.com Whole thread Raw |
In response to | Warehouse Schema ("Worky Workerson" <worky.workerson@gmail.com>) |
List | pgsql-sql |
Worky Workerson wrote: > I'm developing a schema for a large data warehouse (10 billion records) and > had a couple of questions about how to optimize it. My biggest question is > on the design of the fact table and how much normalization I should > actually > do of the data. As much as you can and still afford the hardware to support it. > The data is going to be keyed by IP address, stored as a non-unique IP4 > type. Each record is also tagged with various attributes, such as a > category and a type. Assuming that a category and type are VARCHAR, would > it make sense to normalize these out of the fact table into their > respective > tables and key them by an INTEGER? I.e. > > CREATE TABLE big_fact_table_A ( > identifier IP4, > data1 BYTEA, > data2 BYTEA, > ... > dataN BYTEA, > category VARCHAR(16), > type VARCHAR(16) > ); > > ... vs ... > > CREATE TABLE big_fact_table_B ( > identifier IP4, > data1 BYTEA, > data2 BYTEA, > ... > dataN BYTEA, > category INTEGER REFERENCES categories (category_id), > type INTEGER REFERENCES types (type_id) > ); Assuming categories and types are fixed sets this isn't really normalisation, you're just substituting one representation with another. That's assuming you enforce valid representations on table_A. Of course, if the categories or types can change then you'll want to use foreign-keys. Try int2 rather than int4 even (althoughcheck the manuals for details to see if it'll pack them closer). > I figure that the normalized fact table should be quicker, as the > integer is > much smaller than the varchar. On query, however, the table will need > to be > joined against the two other tables (categories, types), but I still figure > that this is a win because the other tables are fairly small and should > stay > resident in memory. Is this reasoning valid? That's the trade-off. > The downside to this (from my perspective) is that the data comes in the > form of big_fact_table_A and could be easily COPYed straight into the > table. with big_fact_table_B it looks like I will have to do the "unJOIN" > in a script. Also, I have separate installations of the warehouse (with > different data sources) and it will be difficult to share data between them > unless their categories/types tables are keyed with exactly the same > integer > IDs which, as I don't directly control the other installations, is not > guaranteed. > > Any suggestions to the above "problems"? If you're comfortable with a little "C" coding, write your own type. That's really what you're trying to do here. Accept text input and display text output but store in a compact form. HTH -- Richard Huxton Archonet Ltd