Re: Need magic for identifieing double adresses - Mailing list pgsql-general
From | Bill Thoen |
---|---|
Subject | Re: Need magic for identifieing double adresses |
Date | |
Msg-id | 4C924C4B.2060305@gisnet.com Whole thread Raw |
In response to | Need magic for identifieing double adresses (Andreas <maps.on@gmx.net>) |
List | pgsql-general |
> > On 9/16/2010 5:18 AM, Sam Mason wrote: > >> >> On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote: >> >>> >>> I need to clean up a lot of contact data because of a merge of customer >>> >>> lists that used to be kept separate. >>> >>> I allready know that there are double entries within the lists and they >>> >>> do overlap, too. >>> >>> >>> >>> Relevant fields could be name, street, zip, city, phon >>> > > > > Since your're working with words, a better "fuzzy" match algorithm is > > Levenshtein distance. It's surprizingly good about coming up with an > > index number for words (or even phrases). Calculate the Levenshtein > > distance for two words and the closer the indexes are to each other the > > more alike the words are. It's well-documented on Wikipedia (and other > > places) > The problem with using something like this is that it's awkward to do anything apart from the Cartesian product and compare everything with everything else. If you've got 500k rows to start with, that's a *lot* of comparisons to be doing. The nice thing about using something like soundex is that it allows you to use equality when comparing and hence PG has a chance to finish the query in a reasonable amount of time. If you're dealing with a small number of possibilities, I'd agree that some form of edit distance is a very useful tool. > > But overall, you're basically normalizing an address list with people > > names. So I first go through the lists with an eye to getting rid of the > > junk (regexp is great for identifying names with "illeagal" characters) > > and creating a starting name in uppercase. There's always at least one > > data entry clerk (usually several) who do quirky things like include > > notes in the last name field. > Yup, normalisation is very important when dealing with this sort of thing. Reducing free form text down to something more constrained is important. > > You really need to parse each word, > > correct spelling and consider it in the local context, i.e. does it > > "fit" where it is? (as a geocding routine would do with street names) > > There are canonical forms for names and though these vary from country > > to country, they make good templates for a name parser. > I've tried doing something like this before, but tended to give up. I guess all the matching I've been doing has been small enough that it was quicker to handle the odd cases by hand than spending more time writing general purpose code that never seemed to be quite good enough. -- Sam http://samason.me.uk/
pgsql-general by date: