Home > mailing lists

Re: Removing duplicates - Mailing list pgsql-sql

From	Christof Glaser
Subject	Re: Removing duplicates
Date	February 26, 2002 12:55:05
Msg-id	20020226171944.7B8FF636A3@mail.gl.aser.de Whole thread Raw
In response to	Removing duplicates (Matthew Hagerty <matthew@brwholesale.com>)
List	pgsql-sql

Tree view

Matthew,

On Tuesday, 26. February 2002 16:10, Matthew Hagerty wrote:
> Greetings,
>
> I have a customer database (name, address1, address2, city, state, zip)
> and I need a query (or two) that will give me a mailing list with the
> least amount of duplicates possible.  I know that precise matching is
> not possible, i.e. "P.O. Box 123" will never match "PO Box 123" without
> some data massaging, but if I can isolate even 50% of any duplicates,
> that would help greatly.
>
> Also, any suggestions on which parameters to check the duplicates for? 
> My first thoughts were to make sure there were no two addresses the
> same in the same zip code.  Any insight (or examples) would be greatly
> appreciated.

I wrote a (Windows) C++ program some time ago to eliminate duplicate 
address records in a plain text file. Basically it works like this:

1. Choose the fields to check for duplicates. Let's call these primary  fields. I used Name, State, ZIP, City,
Address1,Address2   (in that order).

2. Choose the fields to decide, which entry should be kept, if duplicates  are found. These are secondary fields.
(phone,contact, email, ...)

3. Normalize the primary fields: remove any whitespace, punctuation,  special chars, make lowercase, expand "St." to
"street"etc.   That makes "P.O. Box 123" equal to "PO Box 123", as both are  normalized to "pobox123".

4. Sort on the normalized primary fields. Now duplicates are adjacent.
5. Iterate through the sorted data:  - Compare (normalized) primary fields with the next record  - If equal, we found a
duplicateand start another iteration:    . Check the secondary fields and decide, which record to keep.      (In my
case,those with a phone # would be kept.)    . Compare primary fields again with the next record, checking      for
moreduplicates

Translating the above into SQL is left as an exercise... :-) 

Hope that helps to get you started.

Christof.
--          gl.aser . software engineering . internet service      http://gl.aser.de/ . Planckstraße 7 . D-39104
Magdeburg

pgsql-sql by date:

From: Matthew Hagerty
Date: 26 February 2002, 12:46:46
Subject: Re: Removing duplicates

From: "Andy Marden"
Date: 26 February 2002, 12:56:04
Subject: Re: Join Statements

Re: Removing duplicates - Mailing list pgsql-sql

Previous

Next