Home > mailing lists

Re: Need magic for identifieing double adresses - Mailing list pgsql-general

From	Sam Mason
Subject	Re: Need magic for identifieing double adresses
Date	September 16, 2010 08:19:03
Msg-id	20100916111852.GB7862@samason.me.uk Whole thread Raw
In response to	Need magic for identifieing double adresses (Andreas <maps.on@gmx.net>)
Responses	Re: Need magic for identifieing double adresses
List	pgsql-general

Tree view

On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote:
> I need to clean up a lot of contact data because of a merge of customer
> lists that used to be kept separate.
> I allready know that there are double entries within the lists and they
> do overlap, too.
>
> Relevant fields could be  name, street, zip, city, phone

GROUP BY is your friend here; you basically want to normalise things as
much as possible and then GROUP BY counting number of duplicates and
where this count is greater than one you need to intervene somehow.

Humans are great at typos (2% of records seem to contain a typo of
some sort in my experience, with almost all of them not mattering)
so the first thing would be to correct the typos (or use algorithms
that are less susceptible to typos) and to start getting things
normalised.  Free form text is a bit of a fiddle to normalise, but the
fuzzystrmatch[1] module in PG can help with this.

An example of query that I do a lot of is:

  SELECT soundex(city), array_agg(DISTINCT city) -- array_accum for 8.3 and earlier
  FROM tbl
  GROUP BY 1
  HAVING COUNT(DISTINCT city) > 1
  ORDER BY 2 DESC;

Another common one is:

  SELECT t.*
  FROM tbl t, (
    SELECT soundex(city) AS cty, soundex(name) AS name
    FROM tbl
    GROUP BY 1, 2
    HAVING COUNT(DISTINCT name) > 1) x
  WHERE soundex(t.city) = x.city
    AND soundex(t.name) = x.name;

I.e. find all the entries with similar sounding cities and names where
they have spelled their names differently.  You can then check through
and correct the entries where they really should be the same.

What to do depends on how much data you have; a few thousand and you can
do lots of fiddling by hand, whereas if you have a few tens of millions
of people you want to try and do more with code.

--
  Sam  http://samason.me.uk/

 [1] http://www.postgresql.org/docs/current/static/fuzzystrmatch.html

pgsql-general by date:

From: Craig Ringer
Date: 16 September 2010, 06:22:52
Subject: Re: libssl issue ?

From: Gissur Þórhallsson
Date: 16 September 2010, 09:47:50
Subject: Re: value

Re: Need magic for identifieing double adresses - Mailing list pgsql-general

Previous

Next