Thread: Need magic for identifieing double adresses
Hi, I need to clean up a lot of contact data because of a merge of customer lists that used to be kept separate. I allready know that there are double entries within the lists and they do overlap, too. Relevant fields could be name, street, zip, city, phone Is there a way to do something like this with postgresql ? I fear this will need still a lot of manual sorting and searching even when potential peers get automatically identified. Regards Andreas
Andreas wrote: > I need to clean up a lot of contact data because of a merge of customer > lists that used to be kept separate. > I allready know that there are double entries within the lists and they > do overlap, too. > > Relevant fields could be name, street, zip, city, phone > > Is there a way to do something like this with postgresql ? Yes, yes there is. GROUP BY is your friend and there are other friends too. -- Darren Duncan
Andreas, > Relevant fields could be name, street, zip, city, phone > Is there a way to do something like this with postgresql ? > I fear this will need still a lot of manual sorting and searching even when > potential peers get automatically identified. One of the techniques I use to increase the odds of detecting duplicates is to trim each column, remove all internal whitespace, coalesce it into a single string, and calculate an MD5 (some other hash function may be better) hash. It's not perfect (we are dealing with humans, after all), but it helps. -- Gary Chambers /* Nothing fancy and nothing Microsoft! */
On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote: > I need to clean up a lot of contact data because of a merge of customer > lists that used to be kept separate. > I allready know that there are double entries within the lists and they > do overlap, too. > > Relevant fields could be name, street, zip, city, phone GROUP BY is your friend here; you basically want to normalise things as much as possible and then GROUP BY counting number of duplicates and where this count is greater than one you need to intervene somehow. Humans are great at typos (2% of records seem to contain a typo of some sort in my experience, with almost all of them not mattering) so the first thing would be to correct the typos (or use algorithms that are less susceptible to typos) and to start getting things normalised. Free form text is a bit of a fiddle to normalise, but the fuzzystrmatch[1] module in PG can help with this. An example of query that I do a lot of is: SELECT soundex(city), array_agg(DISTINCT city) -- array_accum for 8.3 and earlier FROM tbl GROUP BY 1 HAVING COUNT(DISTINCT city) > 1 ORDER BY 2 DESC; Another common one is: SELECT t.* FROM tbl t, ( SELECT soundex(city) AS cty, soundex(name) AS name FROM tbl GROUP BY 1, 2 HAVING COUNT(DISTINCT name) > 1) x WHERE soundex(t.city) = x.city AND soundex(t.name) = x.name; I.e. find all the entries with similar sounding cities and names where they have spelled their names differently. You can then check through and correct the entries where they really should be the same. What to do depends on how much data you have; a few thousand and you can do lots of fiddling by hand, whereas if you have a few tens of millions of people you want to try and do more with code. -- Sam http://samason.me.uk/ [1] http://www.postgresql.org/docs/current/static/fuzzystrmatch.html
Am 16.09.2010 13:18, schrieb Sam Mason: > On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote: >> I need to clean up a lot of contact data because of a merge of customer >> lists that used to be kept separate. > What to do depends on how much data you have; a few thousand and you can > do lots of fiddling by hand, whereas if you have a few tens of millions > of people you want to try and do more with code. > Thanks Sam, I'll check this fuzzystrmatch. We are talking about nearly 500.000 records with considerable overlapping. It's not only typos to catch. There is variation in the way to write things that not necessarily are wrong. e.g. Miller's Bakery Bakery Miller Bakery Miller, Ltd. Bakery Miller and sons Bakery Smith (formerly Miller) and the usual Strawberry Street Strawberrystreet Strawberry Str.42 Strawberry Str. 42 Strawberry Str. 42-45 Regards Andreas
On Thu, Sep 16, 2010 at 03:22:15PM +0200, Andreas wrote: > We are talking about nearly 500.000 records with considerable overlapping. Other things to consider is whether each one contains unique entries and hence can you do a "best match" between datasets--FULL OUTER JOIN is your friend here, but duplicates become a problem. > It's not only typos to catch. There is variation in the way to write > things that not necessarily are wrong. > e.g. > Miller's Bakery > Bakery Miller > Bakery Miller, Ltd. > Bakery Miller and sons > Bakery Smith (formerly Miller) Soundex is tolerant to quite a lot of this, but word order is important. When I've had to do this before ~360k merging with ~80k addresses I've gone with normalised postcodes (in the UK postcodes contain a nice mix of letters and numbers meaning that I can be reasonable sure about typos) and then gone through a reasonable chunk by hand to make sure things are working "correctly". Just thought; depending on your spacial sparsity, you may be able to get away with trusting the zip code and checking when the soundex of the name is different. > and the usual > Strawberry Street > Strawberrystreet > Strawberry Str.42 > Strawberry Str. 42 > Strawberry Str. 42-45 Soundex gets those all the same (and even '42-45 Strawberry Str'), so that's easy. In fact it completely ignores the numbers so you'll have to do something specific about them. -- Sam http://samason.me.uk/
> > On 9/16/2010 5:18 AM, Sam Mason wrote: > >> >> On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote: >> >>> >>> I need to clean up a lot of contact data because of a merge of customer >>> >>> lists that used to be kept separate. >>> >>> I allready know that there are double entries within the lists and they >>> >>> do overlap, too. >>> >>> >>> >>> Relevant fields could be name, street, zip, city, phon >>> > > > > Since your're working with words, a better "fuzzy" match algorithm is > > Levenshtein distance. It's surprizingly good about coming up with an > > index number for words (or even phrases). Calculate the Levenshtein > > distance for two words and the closer the indexes are to each other the > > more alike the words are. It's well-documented on Wikipedia (and other > > places) > The problem with using something like this is that it's awkward to do anything apart from the Cartesian product and compare everything with everything else. If you've got 500k rows to start with, that's a *lot* of comparisons to be doing. The nice thing about using something like soundex is that it allows you to use equality when comparing and hence PG has a chance to finish the query in a reasonable amount of time. If you're dealing with a small number of possibilities, I'd agree that some form of edit distance is a very useful tool. > > But overall, you're basically normalizing an address list with people > > names. So I first go through the lists with an eye to getting rid of the > > junk (regexp is great for identifying names with "illeagal" characters) > > and creating a starting name in uppercase. There's always at least one > > data entry clerk (usually several) who do quirky things like include > > notes in the last name field. > Yup, normalisation is very important when dealing with this sort of thing. Reducing free form text down to something more constrained is important. > > You really need to parse each word, > > correct spelling and consider it in the local context, i.e. does it > > "fit" where it is? (as a geocding routine would do with street names) > > There are canonical forms for names and though these vary from country > > to country, they make good templates for a name parser. > I've tried doing something like this before, but tended to give up. I guess all the matching I've been doing has been small enough that it was quicker to handle the odd cases by hand than spending more time writing general purpose code that never seemed to be quite good enough. -- Sam http://samason.me.uk/
On Sep 15, 2010, at 10:40 PM, Andreas wrote: > I need to clean up a lot of contact data because of a merge of customer lists that used to be kept separate. > I allready know that there are double entries within the lists and they do overlap, too. > > Relevant fields could be name, street, zip, city, phone > > Is there a way to do something like this with postgresql ? > > I fear this will need still a lot of manual sorting and searching even when potential peers get automatically identified. I recently started working with the pg_trgm contrib module for matching songs based on titles and writers. This is especiallydifficult because the writer credits end up in one big field with every possible variation on order and namingconventions. So far I have been pleased with the results. For example, the algorithm correctly matched these two songtitles: FONTAINE DI ROMA AKA FOUNTAINS OF ROME FOUNTAINS OF ROME A/K/A FONTANE DI ROMA Trigrams can be indexed, so it is relatively fast to find an initial set of candidates. There is a nice introductory article here: http://www.postgresonline.com/journal/categories/59-pgtrgm John DeSoi, Ph.D.
On Thu, 16 Sep 2010 06:22:15 -0700, Andreas <maps.on@gmx.net> wrote: > It's not only typos to catch. There is variation in the way to write > things that not necessarily are wrong. > e.g. > Miller's Bakery > Bakery Miller > Bakery Miller, Ltd. > Bakery Miller and sons > Bakery Smith (formerly Miller) > > and the usual > Strawberry Street > Strawberrystreet > Strawberry Str.42 > Strawberry Str. 42 > Strawberry Str. 42-45 If this is a one-time procedure, I'd definitely go manually. The key is to quickly bind records and find the "remaining" ones. I'd create a lookup table and bind all similar values to a single value. I would also take each word in the field, turn it to lower case, remove punctuation signs and enter it in another table (original_word varchar, normalized_word varchar). I would then search for the most popular normalized_word, hoping that would throw me back keywords like "strawberry" and "miller". I would then search for those to continue creating the look up table. You might want to write an interface to let you drag all the DISTINCT keywords and drop them to the "single" value. I have never seen it, though. :) Good luck.