Thread: Need magic for identifieing double adresses

Need magic for identifieing double adresses

From

Andreas

Date:

15 September 2010, 23:40:45

  Hi,

I need to clean up a lot of contact data because of a merge of customer
lists that used to be kept separate.
I allready know that there are double entries within the lists and they
do overlap, too.

Relevant fields could be  name, street, zip, city, phone

Is there a way to do something like this with postgresql ?

I fear this will need still a lot of manual sorting and searching even
when potential peers get automatically identified.


Regards
Andreas

Re: Need magic for identifieing double adresses

From

Darren Duncan

Date:

15 September 2010, 23:59:28

Andreas wrote:
> I need to clean up a lot of contact data because of a merge of customer
> lists that used to be kept separate.
> I allready know that there are double entries within the lists and they
> do overlap, too.
>
> Relevant fields could be  name, street, zip, city, phone
>
> Is there a way to do something like this with postgresql ?

Yes, yes there is.  GROUP BY is your friend and there are other friends too. --
Darren Duncan

Re: Need magic for identifieing double adresses

From

Gary Chambers

Date:

16 September 2010, 00:01:38

Andreas,

> Relevant fields could be  name, street, zip, city, phone
> Is there a way to do something like this with postgresql ?
> I fear this will need still a lot of manual sorting and searching even when
> potential peers get automatically identified.

One of the techniques I use to increase the odds of detecting
duplicates is to trim each column, remove all internal whitespace,
coalesce it into a single string, and calculate an MD5 (some other
hash function may be better) hash.  It's not perfect (we are dealing
with humans, after all), but it helps.

-- Gary Chambers

/* Nothing fancy and nothing Microsoft! */

Re: Need magic for identifieing double adresses

From

Sam Mason

Date:

16 September 2010, 08:19:03

On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote:
> I need to clean up a lot of contact data because of a merge of customer
> lists that used to be kept separate.
> I allready know that there are double entries within the lists and they
> do overlap, too.
>
> Relevant fields could be  name, street, zip, city, phone

GROUP BY is your friend here; you basically want to normalise things as
much as possible and then GROUP BY counting number of duplicates and
where this count is greater than one you need to intervene somehow.

Humans are great at typos (2% of records seem to contain a typo of
some sort in my experience, with almost all of them not mattering)
so the first thing would be to correct the typos (or use algorithms
that are less susceptible to typos) and to start getting things
normalised.  Free form text is a bit of a fiddle to normalise, but the
fuzzystrmatch[1] module in PG can help with this.

An example of query that I do a lot of is:

  SELECT soundex(city), array_agg(DISTINCT city) -- array_accum for 8.3 and earlier
  FROM tbl
  GROUP BY 1
  HAVING COUNT(DISTINCT city) > 1
  ORDER BY 2 DESC;

Another common one is:

  SELECT t.*
  FROM tbl t, (
    SELECT soundex(city) AS cty, soundex(name) AS name
    FROM tbl
    GROUP BY 1, 2
    HAVING COUNT(DISTINCT name) > 1) x
  WHERE soundex(t.city) = x.city
    AND soundex(t.name) = x.name;

I.e. find all the entries with similar sounding cities and names where
they have spelled their names differently.  You can then check through
and correct the entries where they really should be the same.

What to do depends on how much data you have; a few thousand and you can
do lots of fiddling by hand, whereas if you have a few tens of millions
of people you want to try and do more with code.

--
  Sam  http://samason.me.uk/

 [1] http://www.postgresql.org/docs/current/static/fuzzystrmatch.html

Re: Need magic for identifieing double adresses

From

Andreas

Date:

16 September 2010, 10:22:21

  Am 16.09.2010 13:18, schrieb Sam Mason:
> On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote:
>> I need to clean up a lot of contact data because of a merge of customer
>> lists that used to be kept separate.
> What to do depends on how much data you have; a few thousand and you can
> do lots of fiddling by hand, whereas if you have a few tens of millions
> of people you want to try and do more with code.
>
Thanks Sam,
I'll check this fuzzystrmatch.

We are talking about nearly 500.000 records with considerable overlapping.
It's not only typos to catch. There is variation in the way to write
things that not necessarily are wrong.
e.g.
Miller's Bakery
Bakery Miller
Bakery Miller, Ltd.
Bakery Miller and sons
Bakery Smith (formerly Miller)

and the usual
Strawberry Street
Strawberrystreet
Strawberry Str.42
Strawberry Str. 42
Strawberry Str. 42-45

Regards
Andreas

Re: Need magic for identifieing double adresses

From

Sam Mason

Date:

16 September 2010, 10:39:19

On Thu, Sep 16, 2010 at 03:22:15PM +0200, Andreas wrote:
> We are talking about nearly 500.000 records with considerable overlapping.

Other things to consider is whether each one contains unique entries and
hence can you do a "best match" between datasets--FULL OUTER JOIN is
your friend here, but duplicates become a problem.

> It's not only typos to catch. There is variation in the way to write
> things that not necessarily are wrong.
> e.g.
> Miller's Bakery
> Bakery Miller
> Bakery Miller, Ltd.
> Bakery Miller and sons
> Bakery Smith (formerly Miller)

Soundex is tolerant to quite a lot of this, but word order is important.
When I've had to do this before ~360k merging with ~80k addresses I've
gone with normalised postcodes (in the UK postcodes contain a nice mix
of letters and numbers meaning that I can be reasonable sure about
typos) and then gone through a reasonable chunk by hand to make sure
things are working "correctly".

Just thought; depending on your spacial sparsity, you may be able
to get away with trusting the zip code and checking when the soundex of
the name is different.

> and the usual
> Strawberry Street
> Strawberrystreet
> Strawberry Str.42
> Strawberry Str. 42
> Strawberry Str. 42-45

Soundex gets those all the same (and even '42-45 Strawberry Str'), so
that's easy.  In fact it completely ignores the numbers so you'll have
to do something specific about them.

--
  Sam  http://samason.me.uk/

Re: Need magic for identifieing double adresses

From

Bill Thoen

Date:

16 September 2010, 14:04:08

> > On 9/16/2010 5:18 AM, Sam Mason wrote:
>
>> >> On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote:
>>
>>> >>> I need to clean up a lot of contact data because of a merge of customer
>>> >>> lists that used to be kept separate.
>>> >>> I allready know that there are double entries within the lists and they
>>> >>> do overlap, too.
>>> >>>
>>> >>> Relevant fields could be  name, street, zip, city, phon
>>>
> >
> > Since your're working with words, a better "fuzzy" match algorithm is
> > Levenshtein distance. It's surprizingly good about coming up with an
> > index number for words (or even phrases). Calculate the Levenshtein
> > distance for two words and the closer the indexes are to each other the
> > more alike the words are. It's well-documented on Wikipedia (and other
> > places)
>

The problem with using something like this is that it's awkward to do
anything apart from the Cartesian product and compare everything with
everything else.  If you've got 500k rows to start with, that's a *lot*
of comparisons to be doing.  The nice thing about using something like
soundex is that it allows you to use equality when comparing and hence
PG has a chance to finish the query in a reasonable amount of time.

If you're dealing with a small number of possibilities, I'd agree that
some form of edit distance is a very useful tool.

> > But overall, you're basically normalizing an address list with people
> > names. So I first go through the lists with an eye to getting rid of the
> > junk (regexp is great for identifying names with "illeagal" characters)
> > and creating a starting name in uppercase. There's always at least one
> > data entry clerk (usually several) who do quirky things like include
> > notes in the last name field.
>

Yup, normalisation is very important when dealing with this sort of
thing.  Reducing free form text down to something more constrained is
important.

> > You really need to parse each word,
> > correct spelling and consider it in the local context, i.e. does it
> > "fit" where it is?  (as a geocding routine would do with street names)
> > There are canonical forms for names and though these vary from country
> > to country, they make good templates for a name parser.
>

I've tried doing something like this before, but tended to give up.  I
guess all the matching I've been doing has been small enough that it was
quicker to handle the odd cases by hand than spending more time writing
general purpose code that never seemed to be quite good enough.

-- Sam http://samason.me.uk/

Re: Need magic for identifieing double adresses

From

John DeSoi

Date:

17 September 2010, 20:05:31

On Sep 15, 2010, at 10:40 PM, Andreas wrote:

> I need to clean up a lot of contact data because of a merge of customer lists that used to be kept separate.
> I allready know that there are double entries within the lists and they do overlap, too.
>
> Relevant fields could be  name, street, zip, city, phone
>
> Is there a way to do something like this with postgresql ?
>
> I fear this will need still a lot of manual sorting and searching even when potential peers get automatically
identified.

I recently started working with the pg_trgm contrib module for matching songs based on titles and writers. This is
especiallydifficult because the writer credits end up in one big field with every possible variation on order and
namingconventions. So far I have been pleased with the results. For example, the algorithm correctly matched these two
songtitles: 

FONTAINE DI ROMA AKA FOUNTAINS OF ROME

FOUNTAINS OF ROME A/K/A FONTANE DI ROMA

Trigrams can be indexed, so it is relatively fast to find an initial set of candidates.

There is a nice introductory article here:

http://www.postgresonline.com/journal/categories/59-pgtrgm

John DeSoi, Ph.D.

Re: Need magic for identifieing double adresses

From

"Octavio Alvarez"

Date:

23 September 2010, 12:38:47

On Thu, 16 Sep 2010 06:22:15 -0700, Andreas <maps.on@gmx.net> wrote:

> It's not only typos to catch. There is variation in the way to write
> things that not necessarily are wrong.
> e.g.
> Miller's Bakery
> Bakery Miller
> Bakery Miller, Ltd.
> Bakery Miller and sons
> Bakery Smith (formerly Miller)
>
> and the usual
> Strawberry Street
> Strawberrystreet
> Strawberry Str.42
> Strawberry Str. 42
> Strawberry Str. 42-45

If this is a one-time procedure, I'd definitely go manually. The key is to
quickly bind records and find the "remaining" ones.

I'd create a lookup table and bind all similar values to a single value.

I would also take each word in the field, turn it to lower case, remove
punctuation signs and enter it in another table (original_word varchar,
normalized_word varchar). I would then search for the most popular
normalized_word, hoping that would throw me back keywords like
"strawberry" and "miller". I would then search for those to continue
creating the look up table.

You might want to write an interface to let you drag all the DISTINCT
keywords and drop them to the "single" value.

I have never seen it, though. :)

Good luck.