Home > mailing lists

Re: extracting location info from string - Mailing list pgsql-sql

From	Rob Sargent
Subject	Re: extracting location info from string
Date	May 25, 2011 12:43:08
Msg-id	4DDD1F7C.5040008@gmail.com Whole thread Raw
In response to	Re: extracting location info from string (Lew <noone@lewscanon.com>)
Responses	Re: extracting location info from string
List	pgsql-sql

Tree view


On 05/24/2011 10:57 AM, Lew wrote:
> Tarlika Elisabeth Schmitz wrote:
>> Lew wrote:
>>> That isn't a table structure, that's a freeform text structure. You
>>> didn't state your question, Tarlika, but your database structure is
>>> terrible. For example, "region" and "country" should be different
>>> columns.
>
>> I presume you are referring to my original post:
>> CREATE TABLE person
>> (
>> id integer NOT NULL,
>> "name" character varying(256) NOT NULL,
>> "location" character varying(256),
>> CONSTRAINT person_pkey PRIMARY KEY (id)
>> );
>>
>> Sorry, this was just a TEMPORARY table I created for quick analysis of
>> my CSV data (now renamed to temp_person).
>>
>>
>>
>> The target table is:
>> CREATE TABLE person
>> (
>> id integer NOT NULL,
>> "name" character varying(100) NOT NULL,
>> country character varying(3),
>> county character varying(3),
>> town character varying(50),
>> CONSTRAINT trainer_pkey PRIMARY KEY (id),
>> CONSTRAINT country_person_fk FOREIGN KEY (country)
>> REFERENCES country (id) MATCH SIMPLE
>> ON UPDATE CASCADE ON DELETE RESTRICT,
>> CONSTRAINT county_person_fk FOREIGN KEY (country, county)
>> REFERENCES county (country, code) MATCH SIMPLE
>> ON UPDATE NO ACTION ON DELETE NO ACTION,
>> );
>
> Ah, yes, that makes much more sense. Temporary tables such as you
> describe can be very convenient and effective. Thanks for the
> clarification.
>
> I think this problem is very widespread, namely how to get structured
> information out of freeform data. I've encountered it many times over
> the years, as have so many I know. I believe that human intervention
> will always be needed for this type of work, e.g., distinguishing place
> names that seem the same or correlating ones that seem distinct. I also
> don't know of any perfect approach. Perhaps the best one can find is a
> probabilistic promise that error will be less than some epsilon.
>
> That said, if you have a robust process to correct errors as the user
> population discovers them, then you can approach perfection
> asymptotically. Sometimes the best solution to a technical problem is a
> good human process. From an engineering standpoint, user feedback is a
> vital element of homeostatic control.
>
> Edward W. Rouse's suggestion of a reference table to resolve different
> forms of address or region identification would fit well with such a
> process.
>
To minimize the ultimately quite necessary human adjudication, one might 
make good use of what is often termed "crowd sourcing":  Keep all the 
distinct "hand entered" values and a map to the final human assessment.  At least repeated oddities won't repeatedly
involvehand editing.

pgsql-sql by date:

From: Tom Lane
Date: 25 May 2011, 11:13:33
Subject: Re: Performance of NOT IN and <> with PG 9.0.4

From: Jasmin Dizdarevic
Date: 25 May 2011, 12:48:04
Subject: Re: Performance of NOT IN and <> with PG 9.0.4

Re: extracting location info from string - Mailing list pgsql-sql

Previous

Next