Re: extracting location info from string - Mailing list pgsql-sql
From | Tarlika Elisabeth Schmitz |
---|---|
Subject | Re: extracting location info from string |
Date | |
Msg-id | 20110523233949.07fb0005@dick.coachhouse Whole thread Raw |
In response to | Re: extracting location info from string (Andrej <andrej.groups@gmail.com>) |
Responses |
Re: extracting location info from string
|
List | pgsql-sql |
On Mon, 23 May 2011 13:11:24 +1200 Andrej <andrej.groups@gmail.com> wrote: >On 23 May 2011 10:00, Tarlika Elisabeth Schmitz ><postgresql3@numerixtechnology.de> wrote: >> On Sun, 22 May 2011 21:05:26 +0100 >> Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote: >> >>>A column contains location information, which may contain any of the >>>following: >>> >>>1) null >>>2) country name (e.g. "France") >>>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") >>>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") >>>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") >> >> >> I also need to cope with variations of COUNTRY.NAME and REGION.NAME. > >sanitising that data will be tedious, - particularly with the >variations on region. Indeed. However, the situation is not quite as bleak as it appears: - I am only dealing with 50 countries (Luxemburg and Vatican are not amongst them) - Only for two countries will city/region be displayed instead of country. - Ultimately, where the only important bit of imformation is the country. - The only really important persons are those from the two countries. >Another thing of great import is whether the city can occur in the >data column all by itself; if yes, it's next to impossible to >distinguish it from a country. Unfortunately this is the case. >Specially if we assume that >typos/misspelling are feasible on top of punctuation ... and former countries like Czechoslovakia ... >If I had a task like that to perform I'd dump the data out to file >and have a good go at it w/ sed & awk, or perl, depending on >how complex & voluminous the data is. I had a quick look at the data (maybe not the most efficient SQL) SELECT id, name, CASE WHEN location is null then null WHEN location !~ '.*,.*' then (select id from country wherename = location) ELSE (select country from county where name = regexp_replace(location, '.*, (Rg\\.? )?(.*)', '\\2')) END AS country, location FROM temp_person Of 17000 historical records, 4400 don't match this simple pattern. Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900 "North America" whatever that is! There are plenty of common + valid region abbreviations. I get about 1000 new records of this type per year. I presume that more recent data are more accurate. I know I won't be able to clean them all up. However, the import process has to be as automatic as possible in such a way that inconsistencies are flagged up for later manual intervention. I say later because, for instance, a person's data will have to be imported with or without location info because other new data will link to it. -- Best Regards, Tarlika Elisabeth Schmitz