Re: Removing duplicates - Mailing list pgsql-sql

From Jeff Self
Subject Re: Removing duplicates
Date
Msg-id 1014740376.12600.20.camel@personnel_test
Whole thread Raw
In response to Removing duplicates  (Matthew Hagerty <matthew@brwholesale.com>)
List pgsql-sql
On Tue, 2002-02-26 at 10:10, Matthew Hagerty wrote:
> Greetings,
> 
> I have a customer database (name, address1, address2, city, state, zip) and 
> I need a query (or two) that will give me a mailing list with the least 
> amount of duplicates possible.  I know that precise matching is not 
> possible, i.e. "P.O. Box 123" will never match "PO Box 123" without some 
> data massaging, but if I can isolate even 50% of any duplicates, that would 
> help greatly.
It depends on which fields you are wanting to check for. If you are
referring to just addresses then, you can do a query using 'Distinct' on
address1. This will give you only one record where address1 is P.O. Box
123. Of course this doesn't help with those that are PO Box 123. You
could always dump your data and write some Perl scripts to alter the
addresses so that they become alike. You could also use the 'Like'
keyword in your sql statement. 
"Select * from customer where address1 like '%Box 123%';

After you run this query, you might want to modify the addresses so that
they are the same.
> Also, any suggestions on which parameters to check the duplicates for?  My 
> first thoughts were to make sure there were no two addresses the same in 
> the same zip code.  Any insight (or examples) would be greatly appreciated.

These constraints should be in the table definitions themselves, which
would prevent these duplications from taking place.

-- 
Jeff Self
Information Technology Analyst
Department of Personnel
City of Newport News
2400 Washington Ave.
Newport News, VA 23607
757-926-6930



pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: Timestamp output
Next
From: Jean-Luc Lachance
Date:
Subject: Re: Sort problem