Re: extracting location info from string - Mailing list pgsql-sql

From Tarlika Elisabeth Schmitz
Subject Re: extracting location info from string
Date
Msg-id 20110523233949.07fb0005@dick.coachhouse
Whole thread Raw
In response to Re: extracting location info from string  (Andrej <andrej.groups@gmail.com>)
Responses Re: extracting location info from string  (Craig Ringer <craig@postnewspapers.com.au>)
List pgsql-sql
On Mon, 23 May 2011 13:11:24 +1200
Andrej <andrej.groups@gmail.com> wrote:

>On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
><postgresql3@numerixtechnology.de> wrote:
>> On Sun, 22 May 2011 21:05:26 +0100
>> Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote:
>>
>>>A column contains location information, which may contain any of the
>>>following:
>>>
>>>1) null
>>>2) country name (e.g. "France")
>>>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>>>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>>>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")
>>
>>
>> I also need to cope with variations of COUNTRY.NAME and REGION.NAME.
>
>sanitising that data will be tedious,  - particularly with the
>variations on region.

Indeed. However, the situation is not quite as bleak as it appears:
- I am only dealing with 50 countries (Luxemburg and Vatican are not
amongst them)
- Only for two countries will city/region be displayed instead of
country.
- Ultimately, where the only important bit of imformation is the
country.
- The only really important persons are those from the two countries.


>Another thing of great import is whether the city can occur in the
>data column all by itself; if yes, it's next to impossible to
>distinguish it from a country.

Unfortunately this is the case.

>Specially if we assume that
>typos/misspelling are feasible on top of punctuation ... 

and former countries like Czechoslovakia ...

>If I had a task like that to perform I'd dump the data out to file
>and have a good go at it w/ sed & awk, or perl, depending on
>how complex & voluminous the data is.

I had a quick look at the data (maybe not the most efficient SQL)

SELECT id, name,    CASE    WHEN location is null then null   WHEN location !~ '.*,.*' then (select id from country
wherename =
 
location)    ELSE (select country from county where name =
regexp_replace(location, '.*, (Rg\\.? )?(.*)', '\\2')) END AS country,   location
FROM temp_person


Of 17000 historical records, 4400 don't match this simple pattern.
Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900
"North America" whatever that is! There are plenty of common +
valid region abbreviations.

I get about 1000 new records of this type per year.

I presume that more recent data are more accurate. I know I won't be
able to clean them all up. 

However, the import process has to be as automatic as possible in such
a way that inconsistencies are flagged up for later manual
intervention. I say later because, for instance, a person's data will
have to be imported with or without location info because other new
data will link to it.

-- 

Best Regards,
Tarlika Elisabeth Schmitz


pgsql-sql by date:

Previous
From: Dean le Roux
Date:
Subject: Re: problem with update data sets from front ends 8.4, Ubuntu 10.04
Next
From: Craig Ringer
Date:
Subject: Re: Which version of PostgreSQL should I use.