On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote:
> I need to clean up a lot of contact data because of a merge of customer
> lists that used to be kept separate.
> I allready know that there are double entries within the lists and they
> do overlap, too.
>
> Relevant fields could be name, street, zip, city, phone
GROUP BY is your friend here; you basically want to normalise things as
much as possible and then GROUP BY counting number of duplicates and
where this count is greater than one you need to intervene somehow.
Humans are great at typos (2% of records seem to contain a typo of
some sort in my experience, with almost all of them not mattering)
so the first thing would be to correct the typos (or use algorithms
that are less susceptible to typos) and to start getting things
normalised. Free form text is a bit of a fiddle to normalise, but the
fuzzystrmatch[1] module in PG can help with this.
An example of query that I do a lot of is:
SELECT soundex(city), array_agg(DISTINCT city) -- array_accum for 8.3 and earlier
FROM tbl
GROUP BY 1
HAVING COUNT(DISTINCT city) > 1
ORDER BY 2 DESC;
Another common one is:
SELECT t.*
FROM tbl t, (
SELECT soundex(city) AS cty, soundex(name) AS name
FROM tbl
GROUP BY 1, 2
HAVING COUNT(DISTINCT name) > 1) x
WHERE soundex(t.city) = x.city
AND soundex(t.name) = x.name;
I.e. find all the entries with similar sounding cities and names where
they have spelled their names differently. You can then check through
and correct the entries where they really should be the same.
What to do depends on how much data you have; a few thousand and you can
do lots of fiddling by hand, whereas if you have a few tens of millions
of people you want to try and do more with code.
--
Sam http://samason.me.uk/
[1] http://www.postgresql.org/docs/current/static/fuzzystrmatch.html