Re: WIP: multivariate statistics / proof of concept - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: WIP: multivariate statistics / proof of concept
Date
Msg-id 747665527.663493.1416073781449.JavaMail.yahoo@jws100192.mail.ne1.yahoo.com
Whole thread Raw
In response to Re: WIP: multivariate statistics / proof of concept  ("Tomas Vondra" <tv@fuzzy.cz>)
Responses Re: WIP: multivariate statistics / proof of concept  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
Tomas Vondra <tv@fuzzy.cz> wrote:
> Dne 13 Listopad 2014, 16:51, Katharina Büchse napsal(a):
>> On 13.11.2014 14:11, Tomas Vondra wrote:
>>
>>> The only place where I think this might work are the associative rules.
>>> It's simple to specify rules like ("ZIP code" implies "city") and we could
>>> even do some simple check against the data to see if it actually makes
>>> sense (and 'disable' the rule if not).
>>
>> and even this simple example has its limits, at least in Germany ZIP
>> codes are not unique for rural areas, where several villages have the
>> same ZIP code.

> as you point out most real-world data either contain bugs
> or are inherently imperfect (we have the same kind of ZIP/city
> inconsistencies in Czech).

You can have lots of fun with U.S. zip code, too. Just on the
nominally "Madison, Wisconsin" zip codes (those starting with 537),
there are several exceptions:

select zipcode, city, locationtype
from zipcode
where zipcode like '537%'
and Decommisioned = 'false'
and zipcodetype = 'STANDARD'
and locationtype in ('PRIMARY', 'ACCEPTABLE')
order by zipcode, city;

zipcode | city | locationtype
---------+-----------+--------------
53703 | MADISON | PRIMARY
53704 | MADISON | PRIMARY
53705 | MADISON | PRIMARY
53706 | MADISON | PRIMARY
53711 | FITCHBURG | ACCEPTABLE
53711 | MADISON | PRIMARY
53713 | FITCHBURG | ACCEPTABLE
53713 | MADISON | PRIMARY
53713 | MONONA | ACCEPTABLE
53714 | MADISON | PRIMARY
53714 | MONONA | ACCEPTABLE
53715 | MADISON | PRIMARY
53716 | MADISON | PRIMARY
53716 | MONONA | ACCEPTABLE
53717 | MADISON | PRIMARY
53718 | MADISON | PRIMARY
53719 | FITCHBURG | ACCEPTABLE
53719 | MADISON | PRIMARY
53725 | MADISON | PRIMARY
53726 | MADISON | PRIMARY
53744 | MADISON | PRIMARY
(21 rows)

If you eliminate the quals besides the zipcode column you get 61
rows and it gets much stranger, with legal municipalities that are
completely surrounded by Madison that the postal service would
rather you didn't use in addressing your envelopes, but they have
to deliver to anyway, and organizations inside Madison receiving
enough mail to (literally) have their own zip code -- where the
postal service allows the organization name as a deliverable
"city".

If you want to have your own fun with this data, you can download
it here:

http://federalgovernmentzipcodes.us/free-zipcode-database.csv

I was able to load it into PostgreSQL with this:

create table zipcode
(
recordnumber integer not null,
zipcode text not null,
zipcodetype text not null,
city text not null,
state text not null,
locationtype text not null,
lat double precision,
long double precision,
xaxis double precision not null,
yaxis double precision not null,
zaxis double precision not null,
worldregion text not null,
country text not null,
locationtext text,
location text,
decommisioned text not null,
taxreturnsfiled bigint,
estimatedpopulation bigint,
totalwages bigint,
notes text
);
comment on column zipcode.zipcode is 'Zipcode or military postal code(FPO/APO)';
comment on column zipcode.zipcodetype is 'Standard, PO BOX Only, Unique, Military(implies APO or FPO)';
comment on column zipcode.city is 'offical city name(s)';
comment on column zipcode.state is 'offical state, territory, or quasi-state (AA, AE, AP) abbreviation code';
comment on column zipcode.locationtype is 'Primary, Acceptable,Not Acceptable';
comment on column zipcode.lat is 'Decimal Latitude, if available';
comment on column zipcode.long is 'Decimal Longitude, if available';
comment on column zipcode.location is 'Standard Display (eg Phoenix, AZ ; Pago Pago, AS ; Melbourne, AU )';
comment on column zipcode.decommisioned is 'If Primary location, Yes implies historical Zipcode, No Implies current
Zipcode;If not Primary, Yes implies Historical Placename'; 
comment on column zipcode.taxreturnsfiled is 'Number of Individual Tax Returns Filed in 2008';
copy zipcode from 'filepath' with (format csv, header);
alter table zipcode add primary key (recordnumber);
create unique index zipcode_city on zipcode (zipcode, city);

I bet there are all sorts of correlation possibilities with, for
example, latitude and longitude and other variables.  With 81831
rows and so many correlations among the columns, it might be a
useful data set to test with.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: "Maeldron T."
Date:
Subject: Re: Failback to old master
Next
From: Tomas Vondra
Date:
Subject: Re: WIP: multivariate statistics / proof of concept