Re: Need magic for identifieing double adresses - Mailing list pgsql-general

From Bill Thoen
Subject Re: Need magic for identifieing double adresses
Date
Msg-id 4C924C4B.2060305@gisnet.com
Whole thread Raw
In response to Need magic for identifieing double adresses  (Andreas <maps.on@gmx.net>)
List pgsql-general
> > On 9/16/2010 5:18 AM, Sam Mason wrote:
>
>> >> On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote:
>>
>>> >>> I need to clean up a lot of contact data because of a merge of customer
>>> >>> lists that used to be kept separate.
>>> >>> I allready know that there are double entries within the lists and they
>>> >>> do overlap, too.
>>> >>>
>>> >>> Relevant fields could be  name, street, zip, city, phon
>>>
> >
> > Since your're working with words, a better "fuzzy" match algorithm is
> > Levenshtein distance. It's surprizingly good about coming up with an
> > index number for words (or even phrases). Calculate the Levenshtein
> > distance for two words and the closer the indexes are to each other the
> > more alike the words are. It's well-documented on Wikipedia (and other
> > places)
>

The problem with using something like this is that it's awkward to do
anything apart from the Cartesian product and compare everything with
everything else.  If you've got 500k rows to start with, that's a *lot*
of comparisons to be doing.  The nice thing about using something like
soundex is that it allows you to use equality when comparing and hence
PG has a chance to finish the query in a reasonable amount of time.

If you're dealing with a small number of possibilities, I'd agree that
some form of edit distance is a very useful tool.


> > But overall, you're basically normalizing an address list with people
> > names. So I first go through the lists with an eye to getting rid of the
> > junk (regexp is great for identifying names with "illeagal" characters)
> > and creating a starting name in uppercase. There's always at least one
> > data entry clerk (usually several) who do quirky things like include
> > notes in the last name field.
>

Yup, normalisation is very important when dealing with this sort of
thing.  Reducing free form text down to something more constrained is
important.


> > You really need to parse each word,
> > correct spelling and consider it in the local context, i.e. does it
> > "fit" where it is?  (as a geocding routine would do with street names)
> > There are canonical forms for names and though these vary from country
> > to country, they make good templates for a name parser.
>

I've tried doing something like this before, but tended to give up.  I
guess all the matching I've been doing has been small enough that it was
quicker to handle the odd cases by hand than spending more time writing
general purpose code that never seemed to be quite good enough.

-- Sam http://samason.me.uk/


pgsql-general by date:

Previous
From: Christine Penner
Date:
Subject: query join issue
Next
From: Dean Rasheed
Date:
Subject: Re: value