Home > mailing lists

Re: Removing duplicates - Mailing list pgsql-sql

From	Jeff Self
Subject	Re: Removing duplicates
Date	February 26, 2002 11:21:29
Msg-id	1014740376.12600.20.camel@personnel_test Whole thread
In response to	Removing duplicates (Matthew Hagerty <matthew@brwholesale.com>)
List	pgsql-sql

Tree view

On Tue, 2002-02-26 at 10:10, Matthew Hagerty wrote:
> Greetings,
> 
> I have a customer database (name, address1, address2, city, state, zip) and 
> I need a query (or two) that will give me a mailing list with the least 
> amount of duplicates possible.  I know that precise matching is not 
> possible, i.e. "P.O. Box 123" will never match "PO Box 123" without some 
> data massaging, but if I can isolate even 50% of any duplicates, that would 
> help greatly.
It depends on which fields you are wanting to check for. If you are
referring to just addresses then, you can do a query using 'Distinct' on
address1. This will give you only one record where address1 is P.O. Box
123. Of course this doesn't help with those that are PO Box 123. You
could always dump your data and write some Perl scripts to alter the
addresses so that they become alike. You could also use the 'Like'
keyword in your sql statement. 
"Select * from customer where address1 like '%Box 123%';

After you run this query, you might want to modify the addresses so that
they are the same.
> Also, any suggestions on which parameters to check the duplicates for?  My 
> first thoughts were to make sure there were no two addresses the same in 
> the same zip code.  Any insight (or examples) would be greatly appreciated.

These constraints should be in the table definitions themselves, which
would prevent these duplications from taking place.

-- 
Jeff Self
Information Technology Analyst
Department of Personnel
City of Newport News
2400 Washington Ave.
Newport News, VA 23607
757-926-6930

pgsql-sql by date:

From: Tom Lane
Date: 26 February 2002, 10:57:24
Subject: Re: Timestamp output

From: Jean-Luc Lachance
Date: 26 February 2002, 11:31:57
Subject: Re: Sort problem

Re: Removing duplicates - Mailing list pgsql-sql

Previous

Next