Thread: extracting location info from string

extracting location info from string

From

Tarlika Elisabeth Schmitz

Date:

22 May 2011, 17:08:30

A column contains location information, which may contain any of the
following:

1) null
2) country name (e.g. "France")
3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")

Using the tables below, I would like to derive COUNTRY.ID, REGION.CODE,
city name.




=====
CREATE TABLE person
( id integer NOT NULL, "name" character varying(256) NOT NULL, "location" character varying(256), CONSTRAINT
person_pkeyPRIMARY KEY (id)
 
);

CREATE TABLE country
( id character varying(3) NOT NULL, -- alpha-3 code "name" character varying(50) NOT NULL, CONSTRAINT country_pkey
PRIMARYKEY (id)
 
);

CREATE TABLE region
( country character varying(3) NOT NULL, code character varying(3) NOT NULL,"name" character varying(50) NOT NULL,
CONSTRAINTregion_pkey PRIMARY KEY (country, code), CONSTRAINT country_region_fk FOREIGN KEY (country)     REFERENCES
country(id) MATCH SIMPLE     ON UPDATE CASCADE ON DELETE RESTRICT
 
)


=====
System: PostgreSQL 8.4


-- 

Best Regards,
Tarlika Elisabeth Schmitz

Re: extracting location info from string

From

Tarlika Elisabeth Schmitz

Date:

22 May 2011, 19:08:28

On Sun, 22 May 2011 21:05:26 +0100
Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote:

>A column contains location information, which may contain any of the
>following:
>
>1) null
>2) country name (e.g. "France")
>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")


I also need to cope with variations of COUNTRY.NAME and REGION.NAME.

Re: extracting location info from string

From

Andrej

Date:

22 May 2011, 22:11:32

On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
<postgresql3@numerixtechnology.de> wrote:
> On Sun, 22 May 2011 21:05:26 +0100
> Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote:
>
>>A column contains location information, which may contain any of the
>>following:
>>
>>1) null
>>2) country name (e.g. "France")
>>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")
>
>
> I also need to cope with variations of COUNTRY.NAME and REGION.NAME.

I'm not sure I fully understand your request, but sanitising that data will be
tedious, whichever way you dice it - particularly with the variations
on region.

Another thing of great import is whether the city can occur in the
data column all by itself; if yes, it's next to impossible to distinguish
it from a country.  Specially if we assume that typos/misspelling are
feasible on top of punctuation ... of course, you could create a list
of valid cities and countries, with homophones thrown in for good
measure, and compare & replace things appropriately, w/ name
w/o a clean match being reported for human interaction =o)

If I had a task like that to perform I'd dump the data out to file
and have a good go at it w/ sed & awk, or perl, depending on
how complex & voluminous the data is.

Cheers,
Andrej

--
Please don't top post, and don't use HTML e-Mail :}  Make your quotes concise.

http://www.georgedillon.com/web/html_email_is_evil.shtml

Re: extracting location info from string

From

Craig Ringer

Date:

22 May 2011, 22:42:47

On 23/05/2011 9:11 AM, Andrej wrote:
> On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
> <postgresql3@numerixtechnology.de>  wrote:
>> On Sun, 22 May 2011 21:05:26 +0100
>> Tarlika Elisabeth Schmitz<postgresql3@numerixtechnology.de>  wrote:
>>
>>> A column contains location information, which may contain any of the
>>> following:
>>>
>>> 1) null
>>> 2) country name (e.g. "France")
>>> 3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>>> 4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>>> 5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")
>>
>>
>> I also need to cope with variations of COUNTRY.NAME and REGION.NAME.

This is a hard problem. You're dealing with free-form data that might be 
easily understood by humans, but relies on various contextual 
information and knowledge that makes it really hard for computers to 
understand.

If you want to do a good job of this, your best bet is to plug in 3rd 
party address analysis software that is dedicated to this task. Most 
(all?) such packages are commercial, proprietary affairs. They exist 
because it's really, really hard to do this right.

> Another thing of great import is whether the city can occur in the
> data column all by itself; if yes, it's next to impossible to distinguish
> it from a country.

Not least because some places are both, eg:
  Luxembourg  The Vatican  Singapore

(The Grand Duchy of Luxembourg has other cities, but still serves as an 
example).

-- 
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

Re: extracting location info from string

From

Lew

Date:

23 May 2011, 11:38:36

On 05/22/2011 09:42 PM, Craig Ringer wrote:
> On 23/05/2011 9:11 AM, Andrej wrote:
>> On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
>> <postgresql3@numerixtechnology.de> wrote:
>>> On Sun, 22 May 2011 21:05:26 +0100
>>> Tarlika Elisabeth Schmitz<postgresql3@numerixtechnology.de> wrote:
>>>
>>>> A column contains location information, which may contain any of the
>>>> following:
>>>>
>>>> 1) null
>>>> 2) country name (e.g. "France")
>>>> 3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>>>> 4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>>>> 5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")
>>>
>>>
>>> I also need to cope with variations of COUNTRY.NAME and REGION.NAME.

That isn't a table structure, that's a freeform text structure.  You didn't 
state your question, Tarlika, but your database structure is terrible.  For 
example, "region" and "country" should be different columns.  Really!

How you get your raw data into those columns can be interesting.

> This is a hard problem. You're dealing with free-form data that might be
> easily understood by humans, but relies on various contextual information and
> knowledge that makes it really hard for computers to understand.
>
> If you want to do a good job of this, your best bet is to plug in 3rd party
> address analysis software that is dedicated to this task. Most (all?) such

These aren't really addresses, as the OP presents them.

> packages are commercial, proprietary affairs. They exist because it's really,
> really hard to do this right.
>
>> Another thing of great import is whether the city can occur in the
>> data column all by itself; if yes, it's next to impossible to distinguish
>> it from a country.
>
> Not least because some places are both, eg:
>
> Luxembourg
> The Vatican
> Singapore
>
> (The Grand Duchy of Luxembourg has other cities, but still serves as an example).

And,of course, you have to distinguish the City of London from London.  New 
York City comprises five boroughs (counties), each of which is itself a city.  (Brooklyn is one of the largest cities
inthe world all by itself.) 

"Region" has different meanings in different areas - it can mean part of a 
county, or state / province, or nation, or continent.  "The Baltic region", 
"the Northeast", "upstate", "the North Country", "Europe" are all regions.

The OP should share more about the semantics of their problem domain and 
whether they really intend those table structures to be table structures.  Really?

-- 
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg

Re: extracting location info from string

From

"Edward W. Rouse"

Date:

23 May 2011, 12:17:41

I would think that changing the location column to hold a FK to a location table, and setting up the location table
withvarious columns for city, region, country and whatever else might be required would be the way to go. It reduces
columnbloat on the main table, provides reuse of location data and is easier to modify in the future. 

Edward W. Rouse

-----Original Message-----
From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Lew
Sent: Monday, May 23, 2011 12:25 AM
To: pgsql-sql@postgresql.org
Subject: Re: [SQL] extracting location info from string

On 05/22/2011 09:42 PM, Craig Ringer wrote:
> On 23/05/2011 9:11 AM, Andrej wrote:
>> On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
>> <postgresql3@numerixtechnology.de> wrote:
>>> On Sun, 22 May 2011 21:05:26 +0100
>>> Tarlika Elisabeth Schmitz<postgresql3@numerixtechnology.de> wrote:
>>>
>>>> A column contains location information, which may contain any of the
>>>> following:
>>>>
>>>> 1) null
>>>> 2) country name (e.g. "France")
>>>> 3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>>>> 4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>>>> 5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")
>>>
>>>
>>> I also need to cope with variations of COUNTRY.NAME and REGION.NAME.

That isn't a table structure, that's a freeform text structure.  You didn't
state your question, Tarlika, but your database structure is terrible.  For
example, "region" and "country" should be different columns.  Really!

How you get your raw data into those columns can be interesting.

> This is a hard problem. You're dealing with free-form data that might be
> easily understood by humans, but relies on various contextual information and
> knowledge that makes it really hard for computers to understand.
>
> If you want to do a good job of this, your best bet is to plug in 3rd party
> address analysis software that is dedicated to this task. Most (all?) such

These aren't really addresses, as the OP presents them.

> packages are commercial, proprietary affairs. They exist because it's really,
> really hard to do this right.
>
>> Another thing of great import is whether the city can occur in the
>> data column all by itself; if yes, it's next to impossible to distinguish
>> it from a country.
>
> Not least because some places are both, eg:
>
> Luxembourg
> The Vatican
> Singapore
>
> (The Grand Duchy of Luxembourg has other cities, but still serves as an example).

And,of course, you have to distinguish the City of London from London.  New
York City comprises five boroughs (counties), each of which is itself a city.  (Brooklyn is one of the largest cities
inthe world all by itself.)  
"Region" has different meanings in different areas - it can mean part of a
county, or state / province, or nation, or continent.  "The Baltic region",
"the Northeast", "upstate", "the North Country", "Europe" are all regions.

The OP should share more about the semantics of their problem domain and
whether they really intend those table structures to be table structures.  Really?

--
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Re: extracting location info from string

From

Tarlika Elisabeth Schmitz

Date:

23 May 2011, 13:02:17

On Mon, 23 May 2011 00:25:16 -0400
Lew <noone@lewscanon.com> wrote:

>That isn't a table structure, that's a freeform text structure.  You
>didn't state your question, Tarlika, but your database structure is
>terrible.  For example, "region" and "country" should be different
>columns. 

I presume you are referring to my original post:
CREATE TABLE person
( id integer NOT NULL, "name" character varying(256) NOT NULL, "location" character varying(256), CONSTRAINT
person_pkeyPRIMARY KEY (id)
 
);

Sorry, this was just a TEMPORARY table I created for quick analysis of
my CSV data (now renamed to temp_person).



The target table is:
CREATE TABLE person
( id integer NOT NULL, "name" character varying(100) NOT NULL, country character varying(3), county character
varying(3),town character varying(50), CONSTRAINT trainer_pkey PRIMARY KEY (id), CONSTRAINT country_person_fk FOREIGN
KEY(country)     REFERENCES country (id) MATCH SIMPLE     ON UPDATE CASCADE ON DELETE RESTRICT, CONSTRAINT
county_person_fkFOREIGN KEY (country, county)     REFERENCES county (country, code) MATCH SIMPLE     ON UPDATE NO
ACTIONON DELETE NO ACTION,
 
);






-- 

Best Regards,
Tarlika Elisabeth Schmitz

Re: extracting location info from string

From

Tarlika Elisabeth Schmitz

Date:

23 May 2011, 20:08:02

On Mon, 23 May 2011 13:11:24 +1200
Andrej <andrej.groups@gmail.com> wrote:

>On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
><postgresql3@numerixtechnology.de> wrote:
>> On Sun, 22 May 2011 21:05:26 +0100
>> Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote:
>>
>>>A column contains location information, which may contain any of the
>>>following:
>>>
>>>1) null
>>>2) country name (e.g. "France")
>>>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>>>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>>>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")
>>
>>
>> I also need to cope with variations of COUNTRY.NAME and REGION.NAME.
>
>sanitising that data will be tedious,  - particularly with the
>variations on region.

Indeed. However, the situation is not quite as bleak as it appears:
- I am only dealing with 50 countries (Luxemburg and Vatican are not
amongst them)
- Only for two countries will city/region be displayed instead of
country.
- Ultimately, where the only important bit of imformation is the
country.
- The only really important persons are those from the two countries.

>Another thing of great import is whether the city can occur in the
>data column all by itself; if yes, it's next to impossible to
>distinguish it from a country.

Unfortunately this is the case.

>Specially if we assume that
>typos/misspelling are feasible on top of punctuation ... 

and former countries like Czechoslovakia ...

>If I had a task like that to perform I'd dump the data out to file
>and have a good go at it w/ sed & awk, or perl, depending on
>how complex & voluminous the data is.

I had a quick look at the data (maybe not the most efficient SQL)

SELECT id, name,    CASE    WHEN location is null then null   WHEN location !~ '.*,.*' then (select id from country
wherename =

location)    ELSE (select country from county where name =
regexp_replace(location, '.*, (Rg\\.? )?(.*)', '\\2')) END AS country,   location
FROM temp_person

Of 17000 historical records, 4400 don't match this simple pattern.
Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900
"North America" whatever that is! There are plenty of common +
valid region abbreviations.

I get about 1000 new records of this type per year.

I presume that more recent data are more accurate. I know I won't be
able to clean them all up. 

However, the import process has to be as automatic as possible in such
a way that inconsistencies are flagged up for later manual
intervention. I say later because, for instance, a person's data will
have to be imported with or without location info because other new
data will link to it.

-- 

Best Regards,
Tarlika Elisabeth Schmitz

Re: extracting location info from string

From

Craig Ringer

Date:

23 May 2011, 21:59:39

On 24/05/2011 6:39 AM, Tarlika Elisabeth Schmitz wrote:

> Indeed. However, the situation is not quite as bleak as it appears:
> - I am only dealing with 50 countries (Luxemburg and Vatican are not
> amongst them)
> - Only for two countries will city/region be displayed instead of
> country.
> - Ultimately, where the only important bit of imformation is the
> country.
> - The only really important persons are those from the two countries.

Ah, see that's critical. You've just been able to restrict the problem 
domain to a much simpler task with a smaller and well-defined range of 
possibilities. Most of the complexity is in the nasty corner cases and 
weirdness, and you've (probably) just cut most of that away.

> Of 17000 historical records, 4400 don't match this simple pattern.
> Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900
> "North America" whatever that is! There are plenty of common +
> valid region abbreviations.
>
> I get about 1000 new records of this type per year.

I'd do this kind of analysis in a PL/Perl or PL/python function myself. 
It's easier to write "If <x> then <y> else <x>" logic in a readable 
form, and such chained tests are usually better for this sort of work. 
That also makes it easier to do a cleanup pass first, where you 
substitute common spelling errors and canonicalize country names.

> However, the import process has to be as automatic as possible in such
> a way that inconsistencies are flagged up for later manual
> intervention. I say later because, for instance, a person's data will
> have to be imported with or without location info because other new
> data will link to it.

That's another good reason to use a PL function for this cleanup work. 
It's easy to INSERT a record into a side table that flags it for later 
examination if necessary,  and to RAISE NOTICE or to issue a NOTIFY if 
you need to do closer-to-realtime checking.

-- 
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

Re: extracting location info from string

From

Lew

Date:

25 May 2011, 06:09:12

Tarlika Elisabeth Schmitz wrote:
> Lew wrote:
>> That isn't a table structure, that's a freeform text structure.  You
>> didn't state your question, Tarlika, but your database structure is
>> terrible.  For example, "region" and "country" should be different
>> columns.

> I presume you are referring to my original post:
> CREATE TABLE person
> (
>    id integer NOT NULL,
>    "name" character varying(256) NOT NULL,
>    "location" character varying(256),
>    CONSTRAINT person_pkey PRIMARY KEY (id)
> );
>
> Sorry, this was just a TEMPORARY table I created for quick analysis of
> my CSV data (now renamed to temp_person).
>
>
>
> The target table is:
> CREATE TABLE person
> (
>    id integer NOT NULL,
>    "name" character varying(100) NOT NULL,
>    country character varying(3),
>    county character varying(3),
>    town character varying(50),
>    CONSTRAINT trainer_pkey PRIMARY KEY (id),
>    CONSTRAINT country_person_fk FOREIGN KEY (country)
>        REFERENCES country (id) MATCH SIMPLE
>        ON UPDATE CASCADE ON DELETE RESTRICT,
>    CONSTRAINT county_person_fk FOREIGN KEY (country, county)
>        REFERENCES county (country, code) MATCH SIMPLE
>        ON UPDATE NO ACTION ON DELETE NO ACTION,
> );

Ah, yes, that makes much more sense.  Temporary tables such as you describe 
can be very convenient and effective.  Thanks for the clarification.

I think this problem is very widespread, namely how to get structured 
information out of freeform data.  I've encountered it many times over the 
years, as have so many I know.  I believe that human intervention will always 
be needed for this type of work, e.g., distinguishing place names that seem 
the same or correlating ones that seem distinct.  I also don't know of any 
perfect approach.  Perhaps the best one can find is a probabilistic promise 
that error will be less than some epsilon.

That said, if you have a robust process to correct errors as the user 
population discovers them, then you can approach perfection asymptotically. 
Sometimes the best solution to a technical problem is a good human process. From an engineering standpoint, user
feedbackis a vital element of 

homeostatic control.

Edward W. Rouse's suggestion of a reference table to resolve different forms 
of address or region identification would fit well with such a process.

-- 
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg

Re: extracting location info from string

From

Rob Sargent

Date:

25 May 2011, 12:43:08


On 05/24/2011 10:57 AM, Lew wrote:
> Tarlika Elisabeth Schmitz wrote:
>> Lew wrote:
>>> That isn't a table structure, that's a freeform text structure. You
>>> didn't state your question, Tarlika, but your database structure is
>>> terrible. For example, "region" and "country" should be different
>>> columns.
>
>> I presume you are referring to my original post:
>> CREATE TABLE person
>> (
>> id integer NOT NULL,
>> "name" character varying(256) NOT NULL,
>> "location" character varying(256),
>> CONSTRAINT person_pkey PRIMARY KEY (id)
>> );
>>
>> Sorry, this was just a TEMPORARY table I created for quick analysis of
>> my CSV data (now renamed to temp_person).
>>
>>
>>
>> The target table is:
>> CREATE TABLE person
>> (
>> id integer NOT NULL,
>> "name" character varying(100) NOT NULL,
>> country character varying(3),
>> county character varying(3),
>> town character varying(50),
>> CONSTRAINT trainer_pkey PRIMARY KEY (id),
>> CONSTRAINT country_person_fk FOREIGN KEY (country)
>> REFERENCES country (id) MATCH SIMPLE
>> ON UPDATE CASCADE ON DELETE RESTRICT,
>> CONSTRAINT county_person_fk FOREIGN KEY (country, county)
>> REFERENCES county (country, code) MATCH SIMPLE
>> ON UPDATE NO ACTION ON DELETE NO ACTION,
>> );
>
> Ah, yes, that makes much more sense. Temporary tables such as you
> describe can be very convenient and effective. Thanks for the
> clarification.
>
> I think this problem is very widespread, namely how to get structured
> information out of freeform data. I've encountered it many times over
> the years, as have so many I know. I believe that human intervention
> will always be needed for this type of work, e.g., distinguishing place
> names that seem the same or correlating ones that seem distinct. I also
> don't know of any perfect approach. Perhaps the best one can find is a
> probabilistic promise that error will be less than some epsilon.
>
> That said, if you have a robust process to correct errors as the user
> population discovers them, then you can approach perfection
> asymptotically. Sometimes the best solution to a technical problem is a
> good human process. From an engineering standpoint, user feedback is a
> vital element of homeostatic control.
>
> Edward W. Rouse's suggestion of a reference table to resolve different
> forms of address or region identification would fit well with such a
> process.
>
To minimize the ultimately quite necessary human adjudication, one might 
make good use of what is often termed "crowd sourcing":  Keep all the 
distinct "hand entered" values and a map to the final human assessment.  At least repeated oddities won't repeatedly
involvehand editing.

Re: extracting location info from string

From

Tarlika Elisabeth Schmitz

Date:

25 May 2011, 15:57:45

On Tue, 24 May 2011 12:57:57 -0400
Lew <noone@lewscanon.com> wrote:

>Tarlika Elisabeth Schmitz wrote:
>>this was just a TEMPORARY table I created for quick analysis
>> of my CSV data (now renamed to temp_person).
>Ah, yes, that makes much more sense.  Temporary tables such as you
>describe can be very convenient and effective. 

The ER model is pretty simple. It has only 30 tables, of which 12 are
"active" tables; the rest are pretty static (e.g. COUNTRY).
There are 7-8 CVS formats and my idea is to import, clean up, normalize
and distribute the data via temp tables and associated triggers.

The whole process has to be pretty much automated as volume is too big
to babysit import. As I said in my other post, with the daily import, I
can't reject data just because part of the data cannot [yet] be tidied
because other data will relate to these.

Therefore my intention is to partially import (using the above example
import a PERSON without resolving location) and flag up cases for later
manual intervention in a log table.

Some data just can't be cleaned up and I have to take a pragmatic
approach - for instance, non-existing countries: I see there is an ISO
standard for split countries such as Czechoslovakia
(http://en.wikipedia.org/wiki/ISO_3166-3). That gets 12 PERSONs off my
list. But what on earth do I do with "North America". Create a new
country because some info is better than none? I can hardly make them
US citizens because I know how upset people get when mistaken for
their neighbours.

>I think this problem is very widespread, namely how to get structured 
>information out of freeform data. 

The PERSON.location is peanuts compared to other data of 20x the volume
and the really important information encoded in a freeform string with
all sorts of abbreviations used.

>That said, if you have a robust process to correct errors as the user 
>population discovers them, then you can approach perfection
>asymptotically.[...]

1 user - moi

> From an engineering standpoint, user feedback is a vital element of 
>homeostatic control.

I'll be having a lot of conversations with myself. ;-)

-- 

Best Regards,
Tarlika Elisabeth Schmitz

Re: extracting location info from string

From

Tarlika Elisabeth Schmitz

Date:

25 May 2011, 18:38:20

On Wed, 25 May 2011 09:25:48 -0600
Rob Sargent <robjsargent@gmail.com> wrote:

>
>
>On 05/24/2011 10:57 AM, Lew wrote:
>> Tarlika Elisabeth Schmitz wrote:
>>
>>> CREATE TABLE person
>>> (
>>> id integer NOT NULL,
>>> "name" character varying(256) NOT NULL,
>>> "location" character varying(256),
>>> CONSTRAINT person_pkey PRIMARY KEY (id)
>>> );
>>>
>>> this was just a TEMPORARY table I created for quick analysis
>>> of my CSV data (now renamed to temp_person).

CREATE TABLE country
( id character varying(3) NOT NULL, -- alpha-3 code "name" character varying(50) NOT NULL, CONSTRAINT country_pkey
PRIMARYKEY (id)
 
);


>To minimize the ultimately quite necessary human adjudication, one
>might make good use of what is often termed "crowd sourcing":  Keep
>all the distinct "hand entered" values and a map to the final human
>assessment. 

I was wondering how to do just that. I don't think it would be a good
idea to hard code this into the clean-up script. Take, for instance,
variations of COUNTRY.NAME spelling. Where would I store these? 

I could do with a concept for this problem, which applies to a lot of
string-type info.

-- 

Best Regards,
Tarlika Elisabeth Schmitz

Re: extracting location info from string

From

Rob Sargent

Date:

25 May 2011, 19:07:44


On 05/25/2011 03:13 PM, Tarlika Elisabeth Schmitz wrote:
> On Wed, 25 May 2011 09:25:48 -0600
> Rob Sargent<robjsargent@gmail.com>  wrote:
>
>>
>>
>> On 05/24/2011 10:57 AM, Lew wrote:
>>> Tarlika Elisabeth Schmitz wrote:
>>>
>>>> CREATE TABLE person
>>>> (
>>>> id integer NOT NULL,
>>>> "name" character varying(256) NOT NULL,
>>>> "location" character varying(256),
>>>> CONSTRAINT person_pkey PRIMARY KEY (id)
>>>> );
>>>>
>>>> this was just a TEMPORARY table I created for quick analysis
>>>> of my CSV data (now renamed to temp_person).
>
> CREATE TABLE country
> (
>    id character varying(3) NOT NULL, -- alpha-3 code
>    "name" character varying(50) NOT NULL,
>    CONSTRAINT country_pkey PRIMARY KEY (id)
> );
>
>
>> To minimize the ultimately quite necessary human adjudication, one
>> might make good use of what is often termed "crowd sourcing":  Keep
>> all the distinct "hand entered" values and a map to the final human
>> assessment.
>
> I was wondering how to do just that. I don't think it would be a good
> idea to hard code this into the clean-up script. Take, for instance,
> variations of COUNTRY.NAME spelling. Where would I store these?
>
> I could do with a concept for this problem, which applies to a lot of
> string-type info.
>
I think you keep your current structures used for deducing the canonical 
forms, but with each unique input encounter you add it to you 
seen-thus-far list which becomes just one more check (possibly the first 
such check).

create table address_input
(   id unique/sequence,   human_input character varying(256),   resolution character varying(256)
)

You may have to add a column for the type of input (if you know for 
instance the input is for street address v. country) or you may want the 
resolution to be portioned in to county, city and so on.

Re: extracting location info from string

From

Andrej

Date:

25 May 2011, 19:21:18

On 26 May 2011 09:13, Tarlika Elisabeth Schmitz
<postgresql3@numerixtechnology.de> wrote:
> On Wed, 25 May 2011 09:25:48 -0600
> Rob Sargent <robjsargent@gmail.com> wrote:
>
>>
>>
>>On 05/24/2011 10:57 AM, Lew wrote:
>>> Tarlika Elisabeth Schmitz wrote:
>>>
>>>> CREATE TABLE person
>>>> (
>>>> id integer NOT NULL,
>>>> "name" character varying(256) NOT NULL,
>>>> "location" character varying(256),
>>>> CONSTRAINT person_pkey PRIMARY KEY (id)
>>>> );
>>>>
>>>> this was just a TEMPORARY table I created for quick analysis
>>>> of my CSV data (now renamed to temp_person).
>
> CREATE TABLE country
> (
>  id character varying(3) NOT NULL, -- alpha-3 code
>  "name" character varying(50) NOT NULL,
>  CONSTRAINT country_pkey PRIMARY KEY (id)
> );
>
>
>>To minimize the ultimately quite necessary human adjudication, one
>>might make good use of what is often termed "crowd sourcing":  Keep
>>all the distinct "hand entered" values and a map to the final human
>>assessment.
>
> I was wondering how to do just that. I don't think it would be a good
> idea to hard code this into the clean-up script. Take, for instance,
> variations of COUNTRY.NAME spelling. Where would I store these?
>
> I could do with a concept for this problem, which applies to a lot of
> string-type info.

I'd start w/ downloading a list as mentioned here:
http://answers.google.com/answers/threadview?id=596822

And run it through a wee perl script using
http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm
to make phonetic matches ...

Then I'd run your own data through DoubleMetaphone, and clean up
matches if not too many false positives show up.



Cheers,
Andrej

Re: extracting location info from string

From

Tarlika Elisabeth Schmitz

Date:

26 May 2011, 18:08:23

On Thu, 26 May 2011 10:15:50 +1200
Andrej <andrej.groups@gmail.com> wrote:

>On 26 May 2011 09:13, Tarlika Elisabeth Schmitz
><postgresql3@numerixtechnology.de> wrote:
>> On Wed, 25 May 2011 09:25:48 -0600
>> Rob Sargent <robjsargent@gmail.com> wrote:
>>
>>>
>>>
>>>On 05/24/2011 10:57 AM, Lew wrote:
>>>> Tarlika Elisabeth Schmitz wrote:
>>>>
>>>>> CREATE TABLE person
>>>>> (
>>>>> id integer NOT NULL,
>>>>> "name" character varying(256) NOT NULL,
>>>>> "location" character varying(256),
>>>>> CONSTRAINT person_pkey PRIMARY KEY (id)
>>>>> );
>>>>>
>>>>> this was just a TEMPORARY table I created for quick analysis
>>>>> of my CSV data (now renamed to temp_person).
>>
>> CREATE TABLE country
>> (
>>  id character varying(3) NOT NULL, -- alpha-3 code
>>  "name" character varying(50) NOT NULL,
>>  CONSTRAINT country_pkey PRIMARY KEY (id)
>> );
>>
>>
>>>To minimize the ultimately quite necessary human adjudication, one
>>>might make good use of what is often termed "crowd sourcing":  Keep
>>>all the distinct "hand entered" values and a map to the final human
>>>assessment.
>>[...]
>> I could do with a concept for this problem, which applies to a lot of
>> string-type info.
>
>I'd start w/ downloading a list as mentioned here:
>http://answers.google.com/answers/threadview?id=596822
>
>And run it through a wee perl script using
>http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm
>to make phonetic matches ...
>
>Then I'd run your own data through DoubleMetaphone, and clean up
>matches if not too many false positives show up.


Many thanks for all your suggestions. It will take me a while to work
my way through these as I have several open ends.


In a similar vein, the PERSONs names are
1) <firstname> <surname>
2) <initials> <surname> (more common)
3) <title> <initials>|<firstname> <surname>


Where I have firstname and or title I'd be quite keen to determine sex
as it would be interesting from a statistics point of view to
distinguish.

I am basically just interested in people from two countries, names
mainly English.

--

Best Regards,
Tarlika Elisabeth Schmitz