Re: Removing duplicates - Mailing list pgsql-sql

From Josh Berkus
Subject Re: Removing duplicates
Date
Msg-id web-810507@davinci.ethosmedia.com
Whole thread Raw
In response to Removing duplicates  (Matthew Hagerty <matthew@brwholesale.com>)
Responses Re: Removing duplicates  (Matthew Hagerty <matthew@brwholesale.com>)
List pgsql-sql
Matt,

> I have a customer database (name, address1, address2, city, state,
>  zip) and I need a query (or two) that will give me a mailing list
>  with the least amount of duplicates possible.  I know that precise
>  matching is not possible, i.e. "P.O. Box 123" will never match "PO
>  Box 123" without some data massaging, but if I can isolate even 50%
>  of any duplicates, that would help greatly.

From the sound of things, you are trying to get out a mailing with theleast number of duplicates you can in a limited
time,rather thantrying to clean up the list for permanent storage.  Chances are, youbought or traded this list from an
outsidesource, yes?
 

In that case, here's some quick de-duplication tricks from myfundraising days:

1. Compare text columns with the punctuation stripped out.  It'samazing how many typographical differences come down to
punctuation.FYI, in Roberto Mello's function catalog (accessable fromhttp://techdocs.postgresql.org/ ) I believe that I
postedsomepunctuation-stripping PL/pgSQL procedures.  For furtherde-duplication, compare only the left 15 characters of
atext field(e.g. SUBSTR(strip_string(address1), 1, 15)), but beware ... this cancause you to weed out some
non-duplicates,such as multiple residentsof large apartment buildings.
 

2. For box office lists, you can use phonetic formulas to comparepersonal names (NOT addresses).  I believe that
Soundexand Metaphoneare included in /contrib these days.  This does not work well onChinese or Southeast Asian names.
 

3. If you got phone numbers along with the addresses, these are anexcellend guage of uniqueness.

> Also, any suggestions on which parameters to check the duplicates
>  for?  My first thoughts were to make sure there were no two
>  addresses the same in the same zip code.  Any insight (or examples)
>  would be greatly appreciated.

Just don't forget that some of the zip codes will probably beerroneous.

-Josh


pgsql-sql by date:

Previous
From: "Dan MacNeil"
Date:
Subject: Re: Removing duplicates
Next
From: panjas51@yahoo.fr (panjas51)
Date:
Subject: [CHALLANGE] Add seconds to a date