Re: extracting location info from string - Mailing list pgsql-sql

From Craig Ringer
Subject Re: extracting location info from string
Date
Msg-id 4DDB02E9.9040509@postnewspapers.com.au
Whole thread Raw
In response to Re: extracting location info from string  (Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de>)
List pgsql-sql
On 24/05/2011 6:39 AM, Tarlika Elisabeth Schmitz wrote:

> Indeed. However, the situation is not quite as bleak as it appears:
> - I am only dealing with 50 countries (Luxemburg and Vatican are not
> amongst them)
> - Only for two countries will city/region be displayed instead of
> country.
> - Ultimately, where the only important bit of imformation is the
> country.
> - The only really important persons are those from the two countries.

Ah, see that's critical. You've just been able to restrict the problem 
domain to a much simpler task with a smaller and well-defined range of 
possibilities. Most of the complexity is in the nasty corner cases and 
weirdness, and you've (probably) just cut most of that away.

> Of 17000 historical records, 4400 don't match this simple pattern.
> Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900
> "North America" whatever that is! There are plenty of common +
> valid region abbreviations.
>
> I get about 1000 new records of this type per year.

I'd do this kind of analysis in a PL/Perl or PL/python function myself. 
It's easier to write "If <x> then <y> else <x>" logic in a readable 
form, and such chained tests are usually better for this sort of work. 
That also makes it easier to do a cleanup pass first, where you 
substitute common spelling errors and canonicalize country names.

> However, the import process has to be as automatic as possible in such
> a way that inconsistencies are flagged up for later manual
> intervention. I say later because, for instance, a person's data will
> have to be imported with or without location info because other new
> data will link to it.

That's another good reason to use a PL function for this cleanup work. 
It's easy to INSERT a record into a side table that flags it for later 
examination if necessary,  and to RAISE NOTICE or to issue a NOTIFY if 
you need to do closer-to-realtime checking.

-- 
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/


pgsql-sql by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Which version of PostgreSQL should I use.
Next
From: Adrian Klaver
Date:
Subject: Re: problem with update data sets from front ends 8.4, Ubuntu 10.04