Re: extracting location info from string - Mailing list pgsql-sql

From Tarlika Elisabeth Schmitz
Subject Re: extracting location info from string
Date
Msg-id 20110525191404.30a85cee@dick.coachhouse
Whole thread Raw
In response to Re: extracting location info from string  (Lew <noone@lewscanon.com>)
List pgsql-sql
On Tue, 24 May 2011 12:57:57 -0400
Lew <noone@lewscanon.com> wrote:

>Tarlika Elisabeth Schmitz wrote:
>>this was just a TEMPORARY table I created for quick analysis
>> of my CSV data (now renamed to temp_person).
>Ah, yes, that makes much more sense.  Temporary tables such as you
>describe can be very convenient and effective. 

The ER model is pretty simple. It has only 30 tables, of which 12 are
"active" tables; the rest are pretty static (e.g. COUNTRY).
There are 7-8 CVS formats and my idea is to import, clean up, normalize
and distribute the data via temp tables and associated triggers.

The whole process has to be pretty much automated as volume is too big
to babysit import. As I said in my other post, with the daily import, I
can't reject data just because part of the data cannot [yet] be tidied
because other data will relate to these.

Therefore my intention is to partially import (using the above example
import a PERSON without resolving location) and flag up cases for later
manual intervention in a log table.

Some data just can't be cleaned up and I have to take a pragmatic
approach - for instance, non-existing countries: I see there is an ISO
standard for split countries such as Czechoslovakia
(http://en.wikipedia.org/wiki/ISO_3166-3). That gets 12 PERSONs off my
list. But what on earth do I do with "North America". Create a new
country because some info is better than none? I can hardly make them
US citizens because I know how upset people get when mistaken for
their neighbours.

>I think this problem is very widespread, namely how to get structured 
>information out of freeform data. 

The PERSON.location is peanuts compared to other data of 20x the volume
and the really important information encoded in a freeform string with
all sorts of abbreviations used.

>That said, if you have a robust process to correct errors as the user 
>population discovers them, then you can approach perfection
>asymptotically.[...]

1 user - moi

> From an engineering standpoint, user feedback is a vital element of 
>homeostatic control.

I'll be having a lot of conversations with myself. ;-)


-- 

Best Regards,
Tarlika Elisabeth Schmitz


pgsql-sql by date:

Previous
From: Jasmin Dizdarevic
Date:
Subject: Re: Performance of NOT IN and <> with PG 9.0.4
Next
From: Tarlika Elisabeth Schmitz
Date:
Subject: Re: extracting location info from string