De-duplicating rows - Mailing list pgsql-sql

From Christophe
Subject De-duplicating rows
Date
Msg-id 87E8E647-4CAD-4491-B687-FA763E0EBE7E@thebuild.com
Whole thread Raw
Responses Re: De-duplicating rows
Re: De-duplicating rows
List pgsql-sql
The Subject: is somewhat imprecise, but here's what I'm trying to do.   
For some reason, my brain is locking up over it.

I'm moving a 7.2 (yes) database to 8.4.  In the table in question, the  
structure is along the lines of:
serial_number    SERIAL, PRIMARY KEYemail        TEXTcreate_date    TIMESTAMPattr1        typeattr2        typeattr3
   type...
 

(The point of the "attr" fields is that there are many more columns  
for each row.)

The new structure removes the "serial_number" field, and uses "email"  
as the primary key, but is otherwise unchanged:
email        TEXT, PRIMARY KEYcreate_date    TIMESTAMPattr1        typeattr2        typeattr3        type...

Now, since this database has been production since 7.2 days, cruft has  
crept in: in particular, there are duplicate email addresses, some  
with mismatched attributes.  The policy decision by the client is that  
the correct row is the one with the earliest timestamp.  (The  
timestamps are widely distributed; it's not the case that there is a  
single timestamp above which all the duplicates live.)  Thus, ideally,  
I want to select exactly one row per "email", picking the row with the  
earliest timestamp in the case that there is more than one row with  
that email.

Any suggestions on how to write such a SELECT?  Of course, I could do  
this with an application against the db, but a single SELECT would be  
great if possible.

TIA!


pgsql-sql by date:

Previous
From: Tim Landscheidt
Date:
Subject: Comparing a string against an XPath result set
Next
From: Richard Huxton
Date:
Subject: Re: De-duplicating rows