Home > mailing lists

De-duplicating rows - Mailing list pgsql-sql

From	Christophe
Subject	De-duplicating rows
Date	July 17, 2009 00:14:54
Msg-id	87E8E647-4CAD-4491-B687-FA763E0EBE7E@thebuild.com Whole thread Raw
Responses	Re: De-duplicating rows Re: De-duplicating rows
List	pgsql-sql

Tree view

The Subject: is somewhat imprecise, but here's what I'm trying to do.   
For some reason, my brain is locking up over it.

I'm moving a 7.2 (yes) database to 8.4.  In the table in question, the  
structure is along the lines of:
serial_number    SERIAL, PRIMARY KEYemail        TEXTcreate_date    TIMESTAMPattr1        typeattr2        typeattr3
   type...
 

(The point of the "attr" fields is that there are many more columns  
for each row.)

The new structure removes the "serial_number" field, and uses "email"  
as the primary key, but is otherwise unchanged:
email        TEXT, PRIMARY KEYcreate_date    TIMESTAMPattr1        typeattr2        typeattr3        type...

Now, since this database has been production since 7.2 days, cruft has  
crept in: in particular, there are duplicate email addresses, some  
with mismatched attributes.  The policy decision by the client is that  
the correct row is the one with the earliest timestamp.  (The  
timestamps are widely distributed; it's not the case that there is a  
single timestamp above which all the duplicates live.)  Thus, ideally,  
I want to select exactly one row per "email", picking the row with the  
earliest timestamp in the case that there is more than one row with  
that email.

Any suggestions on how to write such a SELECT?  Of course, I could do  
this with an application against the db, but a single SELECT would be  
great if possible.

TIA!

pgsql-sql by date:

From: Tim Landscheidt
Date: 16 July 2009, 20:40:14
Subject: Comparing a string against an XPath result set

From: Richard Huxton
Date: 17 July 2009, 04:21:50
Subject: Re: De-duplicating rows

De-duplicating rows - Mailing list pgsql-sql

Previous

Next