Thread: Removing duplicates

Removing duplicates

From
Matthew Hagerty
Date:
Greetings,

I have a customer database (name, address1, address2, city, state, zip) and 
I need a query (or two) that will give me a mailing list with the least 
amount of duplicates possible.  I know that precise matching is not 
possible, i.e. "P.O. Box 123" will never match "PO Box 123" without some 
data massaging, but if I can isolate even 50% of any duplicates, that would 
help greatly.

Also, any suggestions on which parameters to check the duplicates for?  My 
first thoughts were to make sure there were no two addresses the same in 
the same zip code.  Any insight (or examples) would be greatly appreciated.

Thank you,
Matthew



Re: Removing duplicates

From
Andrew Perrin
Date:
Zip code and then address seems reasonable to me. If you want to go
all-out, you could write a script that checks addresses against the USPS's
address system (www.usps.com/ncsc/lookups/lookup_zip+4.html) which will
return a standardized address. If you do that for your whole table you'll
end up being able to compare these standardized addresses against one
another, thereby even matching PO Box 123 vs. P.O. Box 123.

ap

----------------------------------------------------------------------
Andrew J Perrin - andrew_perrin@unc.edu - http://www.unc.edu/~aperrinAssistant Professor of Sociology, U of North
Carolina,Chapel Hill     269 Hamilton Hall, CB#3210, Chapel Hill, NC 27599-3210 USA
 


On Tue, 26 Feb 2002, Matthew Hagerty wrote:

> Greetings,
> 
> I have a customer database (name, address1, address2, city, state, zip) and 
> I need a query (or two) that will give me a mailing list with the least 
> amount of duplicates possible.  I know that precise matching is not 
> possible, i.e. "P.O. Box 123" will never match "PO Box 123" without some 
> data massaging, but if I can isolate even 50% of any duplicates, that would 
> help greatly.
> 
> Also, any suggestions on which parameters to check the duplicates for?  My 
> first thoughts were to make sure there were no two addresses the same in 
> the same zip code.  Any insight (or examples) would be greatly appreciated.
> 
> Thank you,
> Matthew
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> 



Re: Removing duplicates

From
Jeff Self
Date:
On Tue, 2002-02-26 at 10:10, Matthew Hagerty wrote:
> Greetings,
> 
> I have a customer database (name, address1, address2, city, state, zip) and 
> I need a query (or two) that will give me a mailing list with the least 
> amount of duplicates possible.  I know that precise matching is not 
> possible, i.e. "P.O. Box 123" will never match "PO Box 123" without some 
> data massaging, but if I can isolate even 50% of any duplicates, that would 
> help greatly.
It depends on which fields you are wanting to check for. If you are
referring to just addresses then, you can do a query using 'Distinct' on
address1. This will give you only one record where address1 is P.O. Box
123. Of course this doesn't help with those that are PO Box 123. You
could always dump your data and write some Perl scripts to alter the
addresses so that they become alike. You could also use the 'Like'
keyword in your sql statement. 
"Select * from customer where address1 like '%Box 123%';

After you run this query, you might want to modify the addresses so that
they are the same.
> Also, any suggestions on which parameters to check the duplicates for?  My 
> first thoughts were to make sure there were no two addresses the same in 
> the same zip code.  Any insight (or examples) would be greatly appreciated.

These constraints should be in the table definitions themselves, which
would prevent these duplications from taking place.

-- 
Jeff Self
Information Technology Analyst
Department of Personnel
City of Newport News
2400 Washington Ave.
Newport News, VA 23607
757-926-6930



Re: Removing duplicates

From
"Dan MacNeil"
Date:
There is software (saddly not open source) that standardizes your
addresses. Once your addresses are standardized, you can check them
for duplicates.  There is also shrinkwrapped software to eleminate
duplicates.  It might be cheaper & quicker to buy this software...

You should probably poke around on www.usps.gov

Some links I came up with.

http://www.casscertification.net/cassfaq.html


http://www.usps.com/ncsc/addressservices/addressqualityservices/addres
scorrection.htm

http://search.dmoz.org/cgi-bin/search?search=CASS+postal


----- Original Message -----
From: "Matthew Hagerty" <matthew@brwholesale.com>
To: <pgsql-sql@postgresql.org>
Sent: Tuesday, February 26, 2002 10:10 AM
Subject: [SQL] Removing duplicates


> Greetings,
>
> I have a customer database (name, address1, address2, city, state,
zip) and
> I need a query (or two) that will give me a mailing list with the
least
> amount of duplicates possible.  I know that precise matching is not
> possible, i.e. "P.O. Box 123" will never match "PO Box 123" without
some
> data massaging, but if I can isolate even 50% of any duplicates,
that would
> help greatly.
>
> Also, any suggestions on which parameters to check the duplicates
for?  My
> first thoughts were to make sure there were no two addresses the
same in
> the same zip code.  Any insight (or examples) would be greatly
appreciated.
>
> Thank you,
> Matthew
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org
>
>
>



Re: Removing duplicates

From
"Josh Berkus"
Date:
Matt,

> I have a customer database (name, address1, address2, city, state,
>  zip) and I need a query (or two) that will give me a mailing list
>  with the least amount of duplicates possible.  I know that precise
>  matching is not possible, i.e. "P.O. Box 123" will never match "PO
>  Box 123" without some data massaging, but if I can isolate even 50%
>  of any duplicates, that would help greatly.

From the sound of things, you are trying to get out a mailing with theleast number of duplicates you can in a limited
time,rather thantrying to clean up the list for permanent storage.  Chances are, youbought or traded this list from an
outsidesource, yes?
 

In that case, here's some quick de-duplication tricks from myfundraising days:

1. Compare text columns with the punctuation stripped out.  It'samazing how many typographical differences come down to
punctuation.FYI, in Roberto Mello's function catalog (accessable fromhttp://techdocs.postgresql.org/ ) I believe that I
postedsomepunctuation-stripping PL/pgSQL procedures.  For furtherde-duplication, compare only the left 15 characters of
atext field(e.g. SUBSTR(strip_string(address1), 1, 15)), but beware ... this cancause you to weed out some
non-duplicates,such as multiple residentsof large apartment buildings.
 

2. For box office lists, you can use phonetic formulas to comparepersonal names (NOT addresses).  I believe that
Soundexand Metaphoneare included in /contrib these days.  This does not work well onChinese or Southeast Asian names.
 

3. If you got phone numbers along with the addresses, these are anexcellend guage of uniqueness.

> Also, any suggestions on which parameters to check the duplicates
>  for?  My first thoughts were to make sure there were no two
>  addresses the same in the same zip code.  Any insight (or examples)
>  would be greatly appreciated.

Just don't forget that some of the zip codes will probably beerroneous.

-Josh


Re: Removing duplicates

From
Matthew Hagerty
Date:
At 08:39 AM 2/26/2002 -0800, Josh Berkus wrote:
>Matt,
>
> > I have a customer database (name, address1, address2, city, state,
> >  zip) and I need a query (or two) that will give me a mailing list
> >  with the least amount of duplicates possible.  I know that precise
> >  matching is not possible, i.e. "P.O. Box 123" will never match "PO
> >  Box 123" without some data massaging, but if I can isolate even 50%
> >  of any duplicates, that would help greatly.
>
> From the sound of things, you are trying to get out a mailing with the
>  least number of duplicates you can in a limited time, rather than
>  trying to clean up the list for permanent storage.  Chances are, you
>  bought or traded this list from an outside source, yes?

Actually the database is a collection of customers collected over the past 
8 years.  The sales people are "supposed" to try to find customers when 
they call in to place an order, but that does not always happen and 
customers have undoubtedly been added more than once.  Of course, over 8 
years and various computer systems, punctuation, case, spelling, different 
employees, etc. the data is less than perfect.  It is amazing how easily a 
user can enter a duplicate no matter how tricky or smart you think your 
code is! :-)

I will certainly look into the address standardization information that has 
been posted.  Thank you everyone for your input!

Matthew

<snip>



Re: Removing duplicates

From
Christof Glaser
Date:
Matthew,

On Tuesday, 26. February 2002 16:10, Matthew Hagerty wrote:
> Greetings,
>
> I have a customer database (name, address1, address2, city, state, zip)
> and I need a query (or two) that will give me a mailing list with the
> least amount of duplicates possible.  I know that precise matching is
> not possible, i.e. "P.O. Box 123" will never match "PO Box 123" without
> some data massaging, but if I can isolate even 50% of any duplicates,
> that would help greatly.
>
> Also, any suggestions on which parameters to check the duplicates for? 
> My first thoughts were to make sure there were no two addresses the
> same in the same zip code.  Any insight (or examples) would be greatly
> appreciated.

I wrote a (Windows) C++ program some time ago to eliminate duplicate 
address records in a plain text file. Basically it works like this:

1. Choose the fields to check for duplicates. Let's call these primary  fields. I used Name, State, ZIP, City,
Address1,Address2   (in that order).
 
2. Choose the fields to decide, which entry should be kept, if duplicates  are found. These are secondary fields.
(phone,contact, email, ...)
 
3. Normalize the primary fields: remove any whitespace, punctuation,  special chars, make lowercase, expand "St." to
"street"etc.   That makes "P.O. Box 123" equal to "PO Box 123", as both are  normalized to "pobox123".
 
4. Sort on the normalized primary fields. Now duplicates are adjacent.
5. Iterate through the sorted data:  - Compare (normalized) primary fields with the next record  - If equal, we found a
duplicateand start another iteration:    . Check the secondary fields and decide, which record to keep.      (In my
case,those with a phone # would be kept.)    . Compare primary fields again with the next record, checking      for
moreduplicates
 

Translating the above into SQL is left as an exercise... :-) 

Hope that helps to get you started.

Christof.
--          gl.aser . software engineering . internet service      http://gl.aser.de/ . Planckstraße 7 . D-39104
Magdeburg


Re: Removing duplicates

From
gnews84@hotmail.com (OV)
Date:
Besides the obvious problems that you mention such as punctuation
spelling etc., there are also a few other things that you should keep
in mind:
- A lot of your customers may have moved in the past 8 years. In fact
I was reading somewhere that about 10% of US residents move on an
average of once a year. Especially apartment addresses in your
database are very unreliable.
- Phone numbers are even more unreliable because not only that they
change when a person moves but they also change when an area code
split occurs. Also in many cases people may give you a work phone.
-Women change Last Names when they marry, divorce etc.

Based on the budget and the number of records in you database you can
do different things to improve your data quality:
If you have the money to spend and the database is large, you can buy
some commercially available products such as First Logic’s ACE
http://www.firstlogic.com/home.asp
or Group One's "Code-1 Plus"
http://www.g1.com/g1intro.html
These products have a database of all addresses in the US standardized
by USPS standards. They can do "soft matches" on addresses for a given
zip code and then standardize the address. Or they can add zip and zip
+ 4 based on you street address and city. Or they can match state with
zip, county with zip, town with zip etc.
On top of that, you  may want to NCOA your database. NCOA stands for
National change of address and it is a huge database that USPS keeps
with every address change that people made in the past (I am not sure
how far back it goes). You can find many 3rd party vendors that will
NCOA your database. Just go to goggle and do a search under "NCOA".
Be aware that the software can cost at least $20,000 and the NCOA may
cost 1 cent per address.
A cheaper option for a one time job would be to standardize and NCOA
you database through 3rd parties. I am not sure what it will cost but
if I find more info I will post it here.

The third option for improving your data quality is to use some
inexpensive third party databases to match your zip code with area
code and state. For example look at:
http://www.zipinfo.com/

In order to match customers you have to have an idea of how many
records you have on each zip code. If the number is small (less then
200-300) you can match on zip code, primary range and last name.
Primary range is the number before the address. (For example in
"1234 W. Main Street", primary range is 1234. In terms of SQL you need
to match on first 4 characters of Address1 field. If customers are
concentrated in one or a few zip codes then you probably need to match
on full address1.

Hope this helps.

Tony

PS. Did I tell you that I do consulting on data cleansing and
Merge/Purge :-)gnews84@hotmail.com