Thread: extracting location info from string
A column contains location information, which may contain any of the following: 1) null 2) country name (e.g. "France") 3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") 4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") 5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") Using the tables below, I would like to derive COUNTRY.ID, REGION.CODE, city name. ===== CREATE TABLE person ( id integer NOT NULL, "name" character varying(256) NOT NULL, "location" character varying(256), CONSTRAINT person_pkeyPRIMARY KEY (id) ); CREATE TABLE country ( id character varying(3) NOT NULL, -- alpha-3 code "name" character varying(50) NOT NULL, CONSTRAINT country_pkey PRIMARYKEY (id) ); CREATE TABLE region ( country character varying(3) NOT NULL, code character varying(3) NOT NULL,"name" character varying(50) NOT NULL, CONSTRAINTregion_pkey PRIMARY KEY (country, code), CONSTRAINT country_region_fk FOREIGN KEY (country) REFERENCES country(id) MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT ) ===== System: PostgreSQL 8.4 -- Best Regards, Tarlika Elisabeth Schmitz
On Sun, 22 May 2011 21:05:26 +0100 Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote: >A column contains location information, which may contain any of the >following: > >1) null >2) country name (e.g. "France") >3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") >4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") >5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") I also need to cope with variations of COUNTRY.NAME and REGION.NAME.
On 23 May 2011 10:00, Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote: > On Sun, 22 May 2011 21:05:26 +0100 > Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote: > >>A column contains location information, which may contain any of the >>following: >> >>1) null >>2) country name (e.g. "France") >>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") >>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") >>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") > > > I also need to cope with variations of COUNTRY.NAME and REGION.NAME. I'm not sure I fully understand your request, but sanitising that data will be tedious, whichever way you dice it - particularly with the variations on region. Another thing of great import is whether the city can occur in the data column all by itself; if yes, it's next to impossible to distinguish it from a country. Specially if we assume that typos/misspelling are feasible on top of punctuation ... of course, you could create a list of valid cities and countries, with homophones thrown in for good measure, and compare & replace things appropriately, w/ name w/o a clean match being reported for human interaction =o) If I had a task like that to perform I'd dump the data out to file and have a good go at it w/ sed & awk, or perl, depending on how complex & voluminous the data is. Cheers, Andrej -- Please don't top post, and don't use HTML e-Mail :} Make your quotes concise. http://www.georgedillon.com/web/html_email_is_evil.shtml
On 23/05/2011 9:11 AM, Andrej wrote: > On 23 May 2011 10:00, Tarlika Elisabeth Schmitz > <postgresql3@numerixtechnology.de> wrote: >> On Sun, 22 May 2011 21:05:26 +0100 >> Tarlika Elisabeth Schmitz<postgresql3@numerixtechnology.de> wrote: >> >>> A column contains location information, which may contain any of the >>> following: >>> >>> 1) null >>> 2) country name (e.g. "France") >>> 3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") >>> 4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") >>> 5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") >> >> >> I also need to cope with variations of COUNTRY.NAME and REGION.NAME. This is a hard problem. You're dealing with free-form data that might be easily understood by humans, but relies on various contextual information and knowledge that makes it really hard for computers to understand. If you want to do a good job of this, your best bet is to plug in 3rd party address analysis software that is dedicated to this task. Most (all?) such packages are commercial, proprietary affairs. They exist because it's really, really hard to do this right. > Another thing of great import is whether the city can occur in the > data column all by itself; if yes, it's next to impossible to distinguish > it from a country. Not least because some places are both, eg: Luxembourg The Vatican Singapore (The Grand Duchy of Luxembourg has other cities, but still serves as an example). -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
On 05/22/2011 09:42 PM, Craig Ringer wrote: > On 23/05/2011 9:11 AM, Andrej wrote: >> On 23 May 2011 10:00, Tarlika Elisabeth Schmitz >> <postgresql3@numerixtechnology.de> wrote: >>> On Sun, 22 May 2011 21:05:26 +0100 >>> Tarlika Elisabeth Schmitz<postgresql3@numerixtechnology.de> wrote: >>> >>>> A column contains location information, which may contain any of the >>>> following: >>>> >>>> 1) null >>>> 2) country name (e.g. "France") >>>> 3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") >>>> 4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") >>>> 5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") >>> >>> >>> I also need to cope with variations of COUNTRY.NAME and REGION.NAME. That isn't a table structure, that's a freeform text structure. You didn't state your question, Tarlika, but your database structure is terrible. For example, "region" and "country" should be different columns. Really! How you get your raw data into those columns can be interesting. > This is a hard problem. You're dealing with free-form data that might be > easily understood by humans, but relies on various contextual information and > knowledge that makes it really hard for computers to understand. > > If you want to do a good job of this, your best bet is to plug in 3rd party > address analysis software that is dedicated to this task. Most (all?) such These aren't really addresses, as the OP presents them. > packages are commercial, proprietary affairs. They exist because it's really, > really hard to do this right. > >> Another thing of great import is whether the city can occur in the >> data column all by itself; if yes, it's next to impossible to distinguish >> it from a country. > > Not least because some places are both, eg: > > Luxembourg > The Vatican > Singapore > > (The Grand Duchy of Luxembourg has other cities, but still serves as an example). And,of course, you have to distinguish the City of London from London. New York City comprises five boroughs (counties), each of which is itself a city. (Brooklyn is one of the largest cities inthe world all by itself.) "Region" has different meanings in different areas - it can mean part of a county, or state / province, or nation, or continent. "The Baltic region", "the Northeast", "upstate", "the North Country", "Europe" are all regions. The OP should share more about the semantics of their problem domain and whether they really intend those table structures to be table structures. Really? -- Lew Honi soit qui mal y pense. http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg
I would think that changing the location column to hold a FK to a location table, and setting up the location table withvarious columns for city, region, country and whatever else might be required would be the way to go. It reduces columnbloat on the main table, provides reuse of location data and is easier to modify in the future. Edward W. Rouse -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Lew Sent: Monday, May 23, 2011 12:25 AM To: pgsql-sql@postgresql.org Subject: Re: [SQL] extracting location info from string On 05/22/2011 09:42 PM, Craig Ringer wrote: > On 23/05/2011 9:11 AM, Andrej wrote: >> On 23 May 2011 10:00, Tarlika Elisabeth Schmitz >> <postgresql3@numerixtechnology.de> wrote: >>> On Sun, 22 May 2011 21:05:26 +0100 >>> Tarlika Elisabeth Schmitz<postgresql3@numerixtechnology.de> wrote: >>> >>>> A column contains location information, which may contain any of the >>>> following: >>>> >>>> 1) null >>>> 2) country name (e.g. "France") >>>> 3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") >>>> 4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") >>>> 5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") >>> >>> >>> I also need to cope with variations of COUNTRY.NAME and REGION.NAME. That isn't a table structure, that's a freeform text structure. You didn't state your question, Tarlika, but your database structure is terrible. For example, "region" and "country" should be different columns. Really! How you get your raw data into those columns can be interesting. > This is a hard problem. You're dealing with free-form data that might be > easily understood by humans, but relies on various contextual information and > knowledge that makes it really hard for computers to understand. > > If you want to do a good job of this, your best bet is to plug in 3rd party > address analysis software that is dedicated to this task. Most (all?) such These aren't really addresses, as the OP presents them. > packages are commercial, proprietary affairs. They exist because it's really, > really hard to do this right. > >> Another thing of great import is whether the city can occur in the >> data column all by itself; if yes, it's next to impossible to distinguish >> it from a country. > > Not least because some places are both, eg: > > Luxembourg > The Vatican > Singapore > > (The Grand Duchy of Luxembourg has other cities, but still serves as an example). And,of course, you have to distinguish the City of London from London. New York City comprises five boroughs (counties), each of which is itself a city. (Brooklyn is one of the largest cities inthe world all by itself.) "Region" has different meanings in different areas - it can mean part of a county, or state / province, or nation, or continent. "The Baltic region", "the Northeast", "upstate", "the North Country", "Europe" are all regions. The OP should share more about the semantics of their problem domain and whether they really intend those table structures to be table structures. Really? -- Lew Honi soit qui mal y pense. http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
On Mon, 23 May 2011 00:25:16 -0400 Lew <noone@lewscanon.com> wrote: >That isn't a table structure, that's a freeform text structure. You >didn't state your question, Tarlika, but your database structure is >terrible. For example, "region" and "country" should be different >columns. I presume you are referring to my original post: CREATE TABLE person ( id integer NOT NULL, "name" character varying(256) NOT NULL, "location" character varying(256), CONSTRAINT person_pkeyPRIMARY KEY (id) ); Sorry, this was just a TEMPORARY table I created for quick analysis of my CSV data (now renamed to temp_person). The target table is: CREATE TABLE person ( id integer NOT NULL, "name" character varying(100) NOT NULL, country character varying(3), county character varying(3),town character varying(50), CONSTRAINT trainer_pkey PRIMARY KEY (id), CONSTRAINT country_person_fk FOREIGN KEY(country) REFERENCES country (id) MATCH SIMPLE ON UPDATE CASCADE ON DELETE RESTRICT, CONSTRAINT county_person_fkFOREIGN KEY (country, county) REFERENCES county (country, code) MATCH SIMPLE ON UPDATE NO ACTIONON DELETE NO ACTION, ); -- Best Regards, Tarlika Elisabeth Schmitz
On Mon, 23 May 2011 13:11:24 +1200 Andrej <andrej.groups@gmail.com> wrote: >On 23 May 2011 10:00, Tarlika Elisabeth Schmitz ><postgresql3@numerixtechnology.de> wrote: >> On Sun, 22 May 2011 21:05:26 +0100 >> Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote: >> >>>A column contains location information, which may contain any of the >>>following: >>> >>>1) null >>>2) country name (e.g. "France") >>>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen") >>>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen") >>>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen") >> >> >> I also need to cope with variations of COUNTRY.NAME and REGION.NAME. > >sanitising that data will be tedious, - particularly with the >variations on region. Indeed. However, the situation is not quite as bleak as it appears: - I am only dealing with 50 countries (Luxemburg and Vatican are not amongst them) - Only for two countries will city/region be displayed instead of country. - Ultimately, where the only important bit of imformation is the country. - The only really important persons are those from the two countries. >Another thing of great import is whether the city can occur in the >data column all by itself; if yes, it's next to impossible to >distinguish it from a country. Unfortunately this is the case. >Specially if we assume that >typos/misspelling are feasible on top of punctuation ... and former countries like Czechoslovakia ... >If I had a task like that to perform I'd dump the data out to file >and have a good go at it w/ sed & awk, or perl, depending on >how complex & voluminous the data is. I had a quick look at the data (maybe not the most efficient SQL) SELECT id, name, CASE WHEN location is null then null WHEN location !~ '.*,.*' then (select id from country wherename = location) ELSE (select country from county where name = regexp_replace(location, '.*, (Rg\\.? )?(.*)', '\\2')) END AS country, location FROM temp_person Of 17000 historical records, 4400 don't match this simple pattern. Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900 "North America" whatever that is! There are plenty of common + valid region abbreviations. I get about 1000 new records of this type per year. I presume that more recent data are more accurate. I know I won't be able to clean them all up. However, the import process has to be as automatic as possible in such a way that inconsistencies are flagged up for later manual intervention. I say later because, for instance, a person's data will have to be imported with or without location info because other new data will link to it. -- Best Regards, Tarlika Elisabeth Schmitz
On 24/05/2011 6:39 AM, Tarlika Elisabeth Schmitz wrote: > Indeed. However, the situation is not quite as bleak as it appears: > - I am only dealing with 50 countries (Luxemburg and Vatican are not > amongst them) > - Only for two countries will city/region be displayed instead of > country. > - Ultimately, where the only important bit of imformation is the > country. > - The only really important persons are those from the two countries. Ah, see that's critical. You've just been able to restrict the problem domain to a much simpler task with a smaller and well-defined range of possibilities. Most of the complexity is in the nasty corner cases and weirdness, and you've (probably) just cut most of that away. > Of 17000 historical records, 4400 don't match this simple pattern. > Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900 > "North America" whatever that is! There are plenty of common + > valid region abbreviations. > > I get about 1000 new records of this type per year. I'd do this kind of analysis in a PL/Perl or PL/python function myself. It's easier to write "If <x> then <y> else <x>" logic in a readable form, and such chained tests are usually better for this sort of work. That also makes it easier to do a cleanup pass first, where you substitute common spelling errors and canonicalize country names. > However, the import process has to be as automatic as possible in such > a way that inconsistencies are flagged up for later manual > intervention. I say later because, for instance, a person's data will > have to be imported with or without location info because other new > data will link to it. That's another good reason to use a PL function for this cleanup work. It's easy to INSERT a record into a side table that flags it for later examination if necessary, and to RAISE NOTICE or to issue a NOTIFY if you need to do closer-to-realtime checking. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
Tarlika Elisabeth Schmitz wrote: > Lew wrote: >> That isn't a table structure, that's a freeform text structure. You >> didn't state your question, Tarlika, but your database structure is >> terrible. For example, "region" and "country" should be different >> columns. > I presume you are referring to my original post: > CREATE TABLE person > ( > id integer NOT NULL, > "name" character varying(256) NOT NULL, > "location" character varying(256), > CONSTRAINT person_pkey PRIMARY KEY (id) > ); > > Sorry, this was just a TEMPORARY table I created for quick analysis of > my CSV data (now renamed to temp_person). > > > > The target table is: > CREATE TABLE person > ( > id integer NOT NULL, > "name" character varying(100) NOT NULL, > country character varying(3), > county character varying(3), > town character varying(50), > CONSTRAINT trainer_pkey PRIMARY KEY (id), > CONSTRAINT country_person_fk FOREIGN KEY (country) > REFERENCES country (id) MATCH SIMPLE > ON UPDATE CASCADE ON DELETE RESTRICT, > CONSTRAINT county_person_fk FOREIGN KEY (country, county) > REFERENCES county (country, code) MATCH SIMPLE > ON UPDATE NO ACTION ON DELETE NO ACTION, > ); Ah, yes, that makes much more sense. Temporary tables such as you describe can be very convenient and effective. Thanks for the clarification. I think this problem is very widespread, namely how to get structured information out of freeform data. I've encountered it many times over the years, as have so many I know. I believe that human intervention will always be needed for this type of work, e.g., distinguishing place names that seem the same or correlating ones that seem distinct. I also don't know of any perfect approach. Perhaps the best one can find is a probabilistic promise that error will be less than some epsilon. That said, if you have a robust process to correct errors as the user population discovers them, then you can approach perfection asymptotically. Sometimes the best solution to a technical problem is a good human process. From an engineering standpoint, user feedbackis a vital element of homeostatic control. Edward W. Rouse's suggestion of a reference table to resolve different forms of address or region identification would fit well with such a process. -- Lew Honi soit qui mal y pense. http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg
On 05/24/2011 10:57 AM, Lew wrote: > Tarlika Elisabeth Schmitz wrote: >> Lew wrote: >>> That isn't a table structure, that's a freeform text structure. You >>> didn't state your question, Tarlika, but your database structure is >>> terrible. For example, "region" and "country" should be different >>> columns. > >> I presume you are referring to my original post: >> CREATE TABLE person >> ( >> id integer NOT NULL, >> "name" character varying(256) NOT NULL, >> "location" character varying(256), >> CONSTRAINT person_pkey PRIMARY KEY (id) >> ); >> >> Sorry, this was just a TEMPORARY table I created for quick analysis of >> my CSV data (now renamed to temp_person). >> >> >> >> The target table is: >> CREATE TABLE person >> ( >> id integer NOT NULL, >> "name" character varying(100) NOT NULL, >> country character varying(3), >> county character varying(3), >> town character varying(50), >> CONSTRAINT trainer_pkey PRIMARY KEY (id), >> CONSTRAINT country_person_fk FOREIGN KEY (country) >> REFERENCES country (id) MATCH SIMPLE >> ON UPDATE CASCADE ON DELETE RESTRICT, >> CONSTRAINT county_person_fk FOREIGN KEY (country, county) >> REFERENCES county (country, code) MATCH SIMPLE >> ON UPDATE NO ACTION ON DELETE NO ACTION, >> ); > > Ah, yes, that makes much more sense. Temporary tables such as you > describe can be very convenient and effective. Thanks for the > clarification. > > I think this problem is very widespread, namely how to get structured > information out of freeform data. I've encountered it many times over > the years, as have so many I know. I believe that human intervention > will always be needed for this type of work, e.g., distinguishing place > names that seem the same or correlating ones that seem distinct. I also > don't know of any perfect approach. Perhaps the best one can find is a > probabilistic promise that error will be less than some epsilon. > > That said, if you have a robust process to correct errors as the user > population discovers them, then you can approach perfection > asymptotically. Sometimes the best solution to a technical problem is a > good human process. From an engineering standpoint, user feedback is a > vital element of homeostatic control. > > Edward W. Rouse's suggestion of a reference table to resolve different > forms of address or region identification would fit well with such a > process. > To minimize the ultimately quite necessary human adjudication, one might make good use of what is often termed "crowd sourcing": Keep all the distinct "hand entered" values and a map to the final human assessment. At least repeated oddities won't repeatedly involvehand editing.
On Tue, 24 May 2011 12:57:57 -0400 Lew <noone@lewscanon.com> wrote: >Tarlika Elisabeth Schmitz wrote: >>this was just a TEMPORARY table I created for quick analysis >> of my CSV data (now renamed to temp_person). >Ah, yes, that makes much more sense. Temporary tables such as you >describe can be very convenient and effective. The ER model is pretty simple. It has only 30 tables, of which 12 are "active" tables; the rest are pretty static (e.g. COUNTRY). There are 7-8 CVS formats and my idea is to import, clean up, normalize and distribute the data via temp tables and associated triggers. The whole process has to be pretty much automated as volume is too big to babysit import. As I said in my other post, with the daily import, I can't reject data just because part of the data cannot [yet] be tidied because other data will relate to these. Therefore my intention is to partially import (using the above example import a PERSON without resolving location) and flag up cases for later manual intervention in a log table. Some data just can't be cleaned up and I have to take a pragmatic approach - for instance, non-existing countries: I see there is an ISO standard for split countries such as Czechoslovakia (http://en.wikipedia.org/wiki/ISO_3166-3). That gets 12 PERSONs off my list. But what on earth do I do with "North America". Create a new country because some info is better than none? I can hardly make them US citizens because I know how upset people get when mistaken for their neighbours. >I think this problem is very widespread, namely how to get structured >information out of freeform data. The PERSON.location is peanuts compared to other data of 20x the volume and the really important information encoded in a freeform string with all sorts of abbreviations used. >That said, if you have a robust process to correct errors as the user >population discovers them, then you can approach perfection >asymptotically.[...] 1 user - moi > From an engineering standpoint, user feedback is a vital element of >homeostatic control. I'll be having a lot of conversations with myself. ;-) -- Best Regards, Tarlika Elisabeth Schmitz
On Wed, 25 May 2011 09:25:48 -0600 Rob Sargent <robjsargent@gmail.com> wrote: > > >On 05/24/2011 10:57 AM, Lew wrote: >> Tarlika Elisabeth Schmitz wrote: >> >>> CREATE TABLE person >>> ( >>> id integer NOT NULL, >>> "name" character varying(256) NOT NULL, >>> "location" character varying(256), >>> CONSTRAINT person_pkey PRIMARY KEY (id) >>> ); >>> >>> this was just a TEMPORARY table I created for quick analysis >>> of my CSV data (now renamed to temp_person). CREATE TABLE country ( id character varying(3) NOT NULL, -- alpha-3 code "name" character varying(50) NOT NULL, CONSTRAINT country_pkey PRIMARYKEY (id) ); >To minimize the ultimately quite necessary human adjudication, one >might make good use of what is often termed "crowd sourcing": Keep >all the distinct "hand entered" values and a map to the final human >assessment. I was wondering how to do just that. I don't think it would be a good idea to hard code this into the clean-up script. Take, for instance, variations of COUNTRY.NAME spelling. Where would I store these? I could do with a concept for this problem, which applies to a lot of string-type info. -- Best Regards, Tarlika Elisabeth Schmitz
On 05/25/2011 03:13 PM, Tarlika Elisabeth Schmitz wrote: > On Wed, 25 May 2011 09:25:48 -0600 > Rob Sargent<robjsargent@gmail.com> wrote: > >> >> >> On 05/24/2011 10:57 AM, Lew wrote: >>> Tarlika Elisabeth Schmitz wrote: >>> >>>> CREATE TABLE person >>>> ( >>>> id integer NOT NULL, >>>> "name" character varying(256) NOT NULL, >>>> "location" character varying(256), >>>> CONSTRAINT person_pkey PRIMARY KEY (id) >>>> ); >>>> >>>> this was just a TEMPORARY table I created for quick analysis >>>> of my CSV data (now renamed to temp_person). > > CREATE TABLE country > ( > id character varying(3) NOT NULL, -- alpha-3 code > "name" character varying(50) NOT NULL, > CONSTRAINT country_pkey PRIMARY KEY (id) > ); > > >> To minimize the ultimately quite necessary human adjudication, one >> might make good use of what is often termed "crowd sourcing": Keep >> all the distinct "hand entered" values and a map to the final human >> assessment. > > I was wondering how to do just that. I don't think it would be a good > idea to hard code this into the clean-up script. Take, for instance, > variations of COUNTRY.NAME spelling. Where would I store these? > > I could do with a concept for this problem, which applies to a lot of > string-type info. > I think you keep your current structures used for deducing the canonical forms, but with each unique input encounter you add it to you seen-thus-far list which becomes just one more check (possibly the first such check). create table address_input ( id unique/sequence, human_input character varying(256), resolution character varying(256) ) You may have to add a column for the type of input (if you know for instance the input is for street address v. country) or you may want the resolution to be portioned in to county, city and so on.
On 26 May 2011 09:13, Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote: > On Wed, 25 May 2011 09:25:48 -0600 > Rob Sargent <robjsargent@gmail.com> wrote: > >> >> >>On 05/24/2011 10:57 AM, Lew wrote: >>> Tarlika Elisabeth Schmitz wrote: >>> >>>> CREATE TABLE person >>>> ( >>>> id integer NOT NULL, >>>> "name" character varying(256) NOT NULL, >>>> "location" character varying(256), >>>> CONSTRAINT person_pkey PRIMARY KEY (id) >>>> ); >>>> >>>> this was just a TEMPORARY table I created for quick analysis >>>> of my CSV data (now renamed to temp_person). > > CREATE TABLE country > ( > id character varying(3) NOT NULL, -- alpha-3 code > "name" character varying(50) NOT NULL, > CONSTRAINT country_pkey PRIMARY KEY (id) > ); > > >>To minimize the ultimately quite necessary human adjudication, one >>might make good use of what is often termed "crowd sourcing": Keep >>all the distinct "hand entered" values and a map to the final human >>assessment. > > I was wondering how to do just that. I don't think it would be a good > idea to hard code this into the clean-up script. Take, for instance, > variations of COUNTRY.NAME spelling. Where would I store these? > > I could do with a concept for this problem, which applies to a lot of > string-type info. I'd start w/ downloading a list as mentioned here: http://answers.google.com/answers/threadview?id=596822 And run it through a wee perl script using http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm to make phonetic matches ... Then I'd run your own data through DoubleMetaphone, and clean up matches if not too many false positives show up. Cheers, Andrej
On Thu, 26 May 2011 10:15:50 +1200 Andrej <andrej.groups@gmail.com> wrote: >On 26 May 2011 09:13, Tarlika Elisabeth Schmitz ><postgresql3@numerixtechnology.de> wrote: >> On Wed, 25 May 2011 09:25:48 -0600 >> Rob Sargent <robjsargent@gmail.com> wrote: >> >>> >>> >>>On 05/24/2011 10:57 AM, Lew wrote: >>>> Tarlika Elisabeth Schmitz wrote: >>>> >>>>> CREATE TABLE person >>>>> ( >>>>> id integer NOT NULL, >>>>> "name" character varying(256) NOT NULL, >>>>> "location" character varying(256), >>>>> CONSTRAINT person_pkey PRIMARY KEY (id) >>>>> ); >>>>> >>>>> this was just a TEMPORARY table I created for quick analysis >>>>> of my CSV data (now renamed to temp_person). >> >> CREATE TABLE country >> ( >> id character varying(3) NOT NULL, -- alpha-3 code >> "name" character varying(50) NOT NULL, >> CONSTRAINT country_pkey PRIMARY KEY (id) >> ); >> >> >>>To minimize the ultimately quite necessary human adjudication, one >>>might make good use of what is often termed "crowd sourcing": Keep >>>all the distinct "hand entered" values and a map to the final human >>>assessment. >>[...] >> I could do with a concept for this problem, which applies to a lot of >> string-type info. > >I'd start w/ downloading a list as mentioned here: >http://answers.google.com/answers/threadview?id=596822 > >And run it through a wee perl script using >http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm >to make phonetic matches ... > >Then I'd run your own data through DoubleMetaphone, and clean up >matches if not too many false positives show up. Many thanks for all your suggestions. It will take me a while to work my way through these as I have several open ends. In a similar vein, the PERSONs names are 1) <firstname> <surname> 2) <initials> <surname> (more common) 3) <title> <initials>|<firstname> <surname> Where I have firstname and or title I'd be quite keen to determine sex as it would be interesting from a statistics point of view to distinguish. I am basically just interested in people from two countries, names mainly English. -- Best Regards, Tarlika Elisabeth Schmitz