Re: Normalization tools for postgres? - Mailing list pgsql-general
From | Richard Huxton |
---|---|
Subject | Re: Normalization tools for postgres? |
Date | |
Msg-id | 474491D1.9090403@archonet.com Whole thread Raw |
In response to | Re: VB ADODB .Open failing (Richard Broersma Jr <rabroersma@yahoo.com>) |
List | pgsql-general |
Dane Springmeyer wrote: > On Nov 21, 2007, at 1:37 AM, Richard Huxton wrote: >> Dane Springmeyer wrote: >>> These mostly reside in MS access in flat tables and and I am >>> importing them into postgres. >> >> If you are familiar/comfortable with Access and VB, I'd probably do >> the work there. > > Unfortunately I am not. I am only familiar with very basic SQL in Access > and not with VB. And I am not interested in investing any time learning > microsoft products. I'd rather put time in learning postgres and php or > other languages to manipulate data in postgres. Fair enough. It's easy enough to get started with PHP. Oh, you don't need to run it in a webserver, you can do so from the command-line too. Any of the scripting languages will do nicely for this sort of thing - Perl, Python, Ruby etc. >> 1. Import as-is into a table called e.g. raw_data >> >> 2. Identify/add the primary-key (presumably ID in this case) in raw_data >> If no ID, you can add a column of type SERIAL to raw_data and let it >> be populated automatically. >> >> 3. CREATE TABLE lookups.region (id SERIAL, description text NOT NULL >> UNIQUE, PRIMARY KEY id); >> INSERT INTO lookups.region (description) SELECT DISTINCT region from >> raw_data; >> >> 4. Repeat #3 for other lookups >> >> 5. CREATE TABLE processed_data (...); >> INSERT INTO processed_data (id, region_id, ...) >> SELECT raw.id, lkp_reg.id, ... >> FROM raw_data raw >> JOIN lookups.region lkp_reg ON raw.region = lkp_reg.description >> JOIN lookups.whatever... > Wow. That was EXTREMELY helpful. With those concepts I've not been able > to do EXACTLY what I was shooting for AND now understand the SQL well > enough to start thinking of better ways to do it as well. Thank you. Good. > Here is the SQL which inserts the sample data and processes it into 4 > different tables. Perhaps I after you take a look I could post this back > to the group? I've cc:ed the list on this one for you - plenty of smart people on that list. > CREATE TABLE raw_data ( Later on, you might want to consider CREATE TEMPORARY TABLE... but don't worry for the moment. > wid int4, > region character(35), > drain character(65), > eco character(29) These should almost certainly be "varchar" (or "character varying") - unless you actually want each field padded with spaces. In fact, since this is just a temporary thing I'd make them all type "text". > ); > > INSERT INTO raw_data VALUES (11210, 'SW Oregon / N Cali Coast', 'Rogue > River', 'California Coast'); > INSERT INTO raw_data VALUES (11100, 'SW Oregon / N Cali Coast', 'Coastal > grouping of Chetco River and Pistol River', 'California Coast'); You might want to read the manuals regarding the COPY <table> FROM STDIN command. Good for bulk-loading data. Also, you could download the pg-odbc driver and link to PG from Access to copy the data over. > > drop table region; There's an "IF EXISTS" clause you can add to DROP TABLE - prevents errors. > CREATE TABLE region (id SERIAL, name text NOT NULL UNIQUE, PRIMARY KEY > (id)); > INSERT INTO region (name) SELECT DISTINCT region from raw_data; > Select * from region; [snip repeats for other tables] > drop table processed_data; > CREATE TABLE processed_data (wid int4 NOT NULL UNIQUE, region_id int4 > NOT NULL, eco_id int4 NOT NULL, drain_id int4 NOT NULL, PRIMARY KEY (wid)); > INSERT INTO processed_data (wid, region_id, eco_id, drain_id) > SELECT w.wid, r.id, e.id, d.id > FROM raw_data as w, region as r, eco as e, drain as d > WHERE w.region = r.name AND w.drain = d.name AND w.eco = e.name; Good. You'll want to read up on foreign-keys too. You can define them when you build the table, or add them after. Something like (not checked): ALTER TABLE processed_data ADD CONSTRAINT valid_region FOREIGN KEY (region_id) REFERENCES region (id); >> 6. Wrap the entire lot above in BEGIN...END so it either all works or >> it doesn't then VACUUM FULL the database at key points. > > I'm not familiar with these ideas, but I'll look into them in the docs... Read up on VACUUM, ANALYSE and the autovacuum daemon (agent). >> If you're comfortable with a bit of VB/Perl/Python/PHP/plpgsql then >> you can automate that fairly simply. If not, a bit of cut+paste will >> see you there. > > Yes, I think I'll experiment with trying to produce this SQL text with > php. Seems like in a very short time I could have a custom script to > parse any table given column names and types which I'd like to break > out. Very nice. The fun comes when you have to clean up the data - correct spelling mistakes, remove duplicates etc. -- Richard Huxton Archonet Ltd
pgsql-general by date: