Re: WIP: multivariate statistics / proof of concept - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: WIP: multivariate statistics / proof of concept |
Date | |
Msg-id | 747665527.663493.1416073781449.JavaMail.yahoo@jws100192.mail.ne1.yahoo.com Whole thread Raw |
In response to | Re: WIP: multivariate statistics / proof of concept ("Tomas Vondra" <tv@fuzzy.cz>) |
Responses |
Re: WIP: multivariate statistics / proof of concept
(Tomas Vondra <tv@fuzzy.cz>)
|
List | pgsql-hackers |
Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a): >> On 13.11.2014 14:11, Tomas Vondra wrote: >> >>> The only place where I think this might work are the associative rules. >>> It's simple to specify rules like ("ZIP code" implies "city") and we could >>> even do some simple check against the data to see if it actually makes >>> sense (and 'disable' the rule if not). >> >> and even this simple example has its limits, at least in Germany ZIP >> codes are not unique for rural areas, where several villages have the >> same ZIP code. > as you point out most real-world data either contain bugs > or are inherently imperfect (we have the same kind of ZIP/city > inconsistencies in Czech). You can have lots of fun with U.S. zip code, too. Just on the nominally "Madison, Wisconsin" zip codes (those starting with 537), there are several exceptions: select zipcode, city, locationtype from zipcode where zipcode like '537%' and Decommisioned = 'false' and zipcodetype = 'STANDARD' and locationtype in ('PRIMARY', 'ACCEPTABLE') order by zipcode, city; zipcode | city | locationtype ---------+-----------+-------------- 53703 | MADISON | PRIMARY 53704 | MADISON | PRIMARY 53705 | MADISON | PRIMARY 53706 | MADISON | PRIMARY 53711 | FITCHBURG | ACCEPTABLE 53711 | MADISON | PRIMARY 53713 | FITCHBURG | ACCEPTABLE 53713 | MADISON | PRIMARY 53713 | MONONA | ACCEPTABLE 53714 | MADISON | PRIMARY 53714 | MONONA | ACCEPTABLE 53715 | MADISON | PRIMARY 53716 | MADISON | PRIMARY 53716 | MONONA | ACCEPTABLE 53717 | MADISON | PRIMARY 53718 | MADISON | PRIMARY 53719 | FITCHBURG | ACCEPTABLE 53719 | MADISON | PRIMARY 53725 | MADISON | PRIMARY 53726 | MADISON | PRIMARY 53744 | MADISON | PRIMARY (21 rows) If you eliminate the quals besides the zipcode column you get 61 rows and it gets much stranger, with legal municipalities that are completely surrounded by Madison that the postal service would rather you didn't use in addressing your envelopes, but they have to deliver to anyway, and organizations inside Madison receiving enough mail to (literally) have their own zip code -- where the postal service allows the organization name as a deliverable "city". If you want to have your own fun with this data, you can download it here: http://federalgovernmentzipcodes.us/free-zipcode-database.csv I was able to load it into PostgreSQL with this: create table zipcode ( recordnumber integer not null, zipcode text not null, zipcodetype text not null, city text not null, state text not null, locationtype text not null, lat double precision, long double precision, xaxis double precision not null, yaxis double precision not null, zaxis double precision not null, worldregion text not null, country text not null, locationtext text, location text, decommisioned text not null, taxreturnsfiled bigint, estimatedpopulation bigint, totalwages bigint, notes text ); comment on column zipcode.zipcode is 'Zipcode or military postal code(FPO/APO)'; comment on column zipcode.zipcodetype is 'Standard, PO BOX Only, Unique, Military(implies APO or FPO)'; comment on column zipcode.city is 'offical city name(s)'; comment on column zipcode.state is 'offical state, territory, or quasi-state (AA, AE, AP) abbreviation code'; comment on column zipcode.locationtype is 'Primary, Acceptable,Not Acceptable'; comment on column zipcode.lat is 'Decimal Latitude, if available'; comment on column zipcode.long is 'Decimal Longitude, if available'; comment on column zipcode.location is 'Standard Display (eg Phoenix, AZ ; Pago Pago, AS ; Melbourne, AU )'; comment on column zipcode.decommisioned is 'If Primary location, Yes implies historical Zipcode, No Implies current Zipcode;If not Primary, Yes implies Historical Placename'; comment on column zipcode.taxreturnsfiled is 'Number of Individual Tax Returns Filed in 2008'; copy zipcode from 'filepath' with (format csv, header); alter table zipcode add primary key (recordnumber); create unique index zipcode_city on zipcode (zipcode, city); I bet there are all sorts of correlation possibilities with, for example, latitude and longitude and other variables. With 81831 rows and so many correlations among the columns, it might be a useful data set to test with. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: